Web-accessible text databases contain large quantities of textual information. Users may search this information using general-purpose search engines or other specialized tools. Text documents are represented by content summaries, which are generated and stored on large servers. Therefore, to satisfy an information need, only the content summaries are consulted, without accessing the source databases.
An optimistic assumption used is that these databases are static, or change rarely. This means that the only tool one can use to respond to changes is to perform periodic content summary updates to reflect changes in the documents. Since periodic updates are costly, sophisticated tools are required to predict when the content summary updates should be performed. This synchronization is required in order for the content summaries to reflect the content of the documents stored in the source databases.
To attack this problem, the authors study the application of well-known statistical methods. Specifically, the proportional hazards regression is used, which is a survival analysis tool. Using this method, update schedules can be defined in order to contact the source databases only when this is necessary, avoiding unnecessary periodic updates of content summaries.
The effectiveness of the proposed scheme is studied by means of results obtained in experiments with real-life data. Experiments were performed on 152 text databases, and their evolution during a period of 52 weeks was tracked. The evolution of these databases was used as a basis to predict changes in the corresponding content summaries, and therefore to schedule when to contact the source databases and update the summaries. The methods used for comparison were: the naive method, which applies periodic updates of content summaries every T weeks; the machine learning method, which treats the scheduling as a binary classification problem; the sampling-based method, which samples the source databases and estimates the fraction of the documents that have been changed; and the proposed survival analysis method. The results show that the proposed method outperforms the alternatives both in terms of recall and precision.