Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Modeling and managing changes in text databases
Ipeirotis P., Ntoulas A., Cho J., Gravano L. ACM Transactions on Database Systems32 (3):14-es,2007.Type:Article
Date Reviewed: Dec 20 2007

Web-accessible text databases contain large quantities of textual information. Users may search this information using general-purpose search engines or other specialized tools. Text documents are represented by content summaries, which are generated and stored on large servers. Therefore, to satisfy an information need, only the content summaries are consulted, without accessing the source databases.

An optimistic assumption used is that these databases are static, or change rarely. This means that the only tool one can use to respond to changes is to perform periodic content summary updates to reflect changes in the documents. Since periodic updates are costly, sophisticated tools are required to predict when the content summary updates should be performed. This synchronization is required in order for the content summaries to reflect the content of the documents stored in the source databases.

To attack this problem, the authors study the application of well-known statistical methods. Specifically, the proportional hazards regression is used, which is a survival analysis tool. Using this method, update schedules can be defined in order to contact the source databases only when this is necessary, avoiding unnecessary periodic updates of content summaries.

The effectiveness of the proposed scheme is studied by means of results obtained in experiments with real-life data. Experiments were performed on 152 text databases, and their evolution during a period of 52 weeks was tracked. The evolution of these databases was used as a basis to predict changes in the corresponding content summaries, and therefore to schedule when to contact the source databases and update the summaries. The methods used for comparison were: the naive method, which applies periodic updates of content summaries every T weeks; the machine learning method, which treats the scheduling as a binary classification problem; the sampling-based method, which samples the source databases and estimates the fraction of the documents that have been changed; and the proposed survival analysis method. The results show that the proposed method outperforms the alternatives both in terms of recall and precision.

Reviewer:  Apostolos Papadopoulos Review #: CR135043
Bookmark and Share
  Reviewer Selected
 
 
Textual Databases (H.2.4 ... )
 
 
Distributed Databases (H.2.4 ... )
 
 
Indexing Methods (H.3.1 ... )
 
 
Information Filtering (H.3.3 ... )
 
 
Information Networks (H.3.4 ... )
 
 
Large Text Archives (H.3.6 ... )
 
  more  
Would you recommend this review?
yes
no
Other reviews under "Textual Databases": Date
Text databases & document management: theory & practice
Chin A. Idea Group Publishing, Hershey, PA,2001. Type: Divisible Book
May 1 2001
Semantic clustering of XML documents
Tagarelli A., Greco S. ACM Transactions on Information Systems 28(1): 1-56, 2010. Type: Article
May 28 2010

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy