Computing Reviews, the leading online review service for computing literature.

Search

Modeling and managing changes in text databases
Ipeirotis P., Ntoulas A., Cho J., Gravano L. ACM Transactions on Database Systems32 (3):14-es,2007.Type:Article

Date Reviewed: Dec 20 2007

Web-accessible text databases contain large quantities of textual information. Users may search this information using general-purpose search engines or other specialized tools. Text documents are represented by content summaries, which are generated and stored on large servers. Therefore, to satisfy an information need, only the content summaries are consulted, without accessing the source databases. An optimistic assumption used is that these databases are static, or change rarely. This means that the only tool one can use to respond to changes is to perform periodic content summary updates to reflect changes in the documents. Since periodic updates are costly, sophisticated tools are required to predict when the content summary updates should be performed. This synchronization is required in order for the content summaries to reflect the content of the documents stored in the source databases. To attack this problem, the authors study the application of well-known statistical methods. Specifically, the proportional hazards regression is used, which is a survival analysis tool. Using this method, update schedules can be defined in order to contact the source databases only when this is necessary, avoiding unnecessary periodic updates of content summaries. The effectiveness of the proposed scheme is studied by means of results obtained in experiments with real-life data. Experiments were performed on 152 text databases, and their evolution during a period of 52 weeks was tracked. The evolution of these databases was used as a basis to predict changes in the corresponding content summaries, and therefore to schedule when to contact the source databases and update the summaries. The methods used for comparison were: the naive method, which applies periodic updates of content summaries every T weeks; the machine learning method, which treats the scheduling as a binary classification problem; the sampling-based method, which samples the source databases and estimates the fraction of the documents that have been changed; and the proposed survival analysis method. The results show that the proposed method outperforms the alternatives both in terms of recall and precision.

Reviewer: Apostolos Papadopoulos	Review #: CR135043

Textual Databases (H.2.4 ... )

Distributed Databases (H.2.4 ... )

Indexing Methods (H.3.1 ... )

Information Filtering (H.3.3 ... )

Information Networks (H.3.4 ... )

Large Text Archives (H.3.6 ... )

Would you recommend this review?

yes

Other reviews under "Textual Databases":	Date

Text databases & document management: theory & practice Chin A. Idea Group Publishing, Hershey, PA,2001. Type: Divisible Book	May 1 2001

Semantic clustering of XML documents Tagarelli A., Greco S. ACM Transactions on Information Systems 28(1): 1-56, 2010. Type: Article	May 28 2010

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy