Computing Reviews

Efficient online index construction for text databases
Lester N., Moffat A., Zobel J. ACM Transactions on Database Systems33(3):1-33,2008.Type:Article
Date Reviewed: 02/02/09

In many information retrieval applications, the update of inverted index structures needs to be online, since such indexes should always be current and accessible for query processing. This study introduces an online index construction technique for document-sorted inverted indexes.

First, the paper introduces the inverted index concept. This is followed by a detailed explanation of three online index construction techniques: rebuild, in-place update, and remerge. These methods have been explored by Lester, Moffat, and Zobel in their previous work [1]. In the remerge approach, the entire on-disk index is read, extended by using the added documents, and written to a new disk location. This approach requires processing of the entire index when the in-memory and on-disk parts are merged, and its data transfer cost is “proportional to the total number of postings in the resulting index.” All of these index construction techniques use “a single, contiguous, inverted list per term.” The authors describe a new method that uses the remerge technique, but, this time, “break the index into a tightly controlled number of partitions” by using a technique called geometric partitioning. According to the authors, the partitioning idea is inspired by the “too-many-fragments approach of Tomasic et al.” [2]. By controlling the number of partitions, the merge activities are handled in such a way that the processing cost is kept at a reasonable level. The method uses periodic merging to prevent the excessive growth of number of partitions that may increase query-processing time. The geometric partitioning idea is also used for vocabulary updates by storing vocabulary in multiple B+ tree structures.

The authors analyze their methods and validate them by using the 426 GB gov2 Web-based document collection and a sequence of 10,000 queries provided by Microsoft Search. In the query-processing experiments, they use the index at 49 different points during its construction. They show that their approach scales well, and construction costs are significantly reduced, compared to standard contiguous representation of inverted indexes. Furthermore, controlling and limiting the number of index segments reduces query-processing time increases caused by partitioning.

The paper is well written and requires careful reading. The work is valuable for researchers and students in information-retrieval courses.


1)

Lester, N.; Moffat, A.; Zobel, J. Fast Online Index Construction Via Geometric Partitioning. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM’05) ACM, 2005, 776–783.


2)

Tomasic, A.; García-Molina, H.; Shoens, K. Incremental Updates of Inverted Lists for Text Document Retrieval. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’94) ACM, 1994, 289–300.

Reviewer:  F. Can Review #: CR136478 (0910-0962)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy