Computing Reviews, the leading online review service for computing literature.

Search

Effect of relationships between words on Japanese information retrieval
Matsumura A., Takasu A., Adachi J. ACM Transactions on Asian Language Information Processing5 (3):264-289,2006.Type:Article

Date Reviewed: May 18 2007

The most common approach to the implementation of information retrieval (IR) systems is the term frequency-inverse document frequency (tf-idf) model. In this model, in a given document, a word with high document frequency and low collection frequency would have a relatively higher importance in document ranking. The implementation of the model is realized using an inverted file structure, which provides an efficient retrieval environment, especially with the use of dynamic pruning techniques introduced for the Web environment [1]. This work introduces two IR methods, and compares their retrieval performance with that of the tf-idf model. The first method uses dependency relationships between words in a sentence. The second method uses proximity relationships, mainly ordered cooccurrence information of words in a sentence, to approximate the dependency relationships between words. For relationship representation, a structured index in the form of a binary tree is constructed for each document. The index creation involves morphological, dependency, and compound noun analyses. The same structure is also constructed for the queries, which are expressed in the form of complete sentences. The document and query index structures are together used in the search process. The authors show that in Japanese IR, using full sentence queries, these methods, with properly chosen parameters, are superior to the tf-idf model, and can increase IR effectiveness by up to 22 percent. The presentation in the paper flows nicely. However, the authors’ claim regarding the superior performance of the methods “independently of the target collection and search topic set” is too strong, since the experiments are based on the subsets of the National Academic Center for Science Information Systems (NACSIS) Test Collection for IR systems (NTCIR-1). Additionally, there are a number of issues and concerns to be addressed. My major concern is the use of full sentence queries. For example, most Web queries consist of few words. Hence, the approach is hard to generalize to most real-life situations. Furthermore, a comparison with IR techniques that use query term distance information in a statistical sense is missing. Search efficiency is an important issue that is not addressed, but it is on the future research agenda of the authors. The NTCIR -1 test collection being used in the experiments contains 330,000 documents, and 83 queries (topics). The authors indicate that it is the largest Japanese IR test collection. Its size, when compared with the sizes of some English TREC test collections, is relatively small. This observation on Japanese IR is an indication of the existence of room for more aggressive IR research in (probably most) non-English languages.

Reviewer: F. Can	Review #: CR134297

1)	Zobel, J.; Moffat, A. Inverted files for text search engines. ACM Computing Surveys 38, 2(2006), Article–6.

Indexing Methods (H.3.1 ... )

Linguistic Processing (H.3.1 ... )

Performance Evaluation (Efficiency And Effectiveness) (H.3.4 ... )

Search Process (H.3.3 ... )

Content Analysis And Indexing (H.3.1 )

Information Search And Retrieval (H.3.3 )

Would you recommend this review?

yes

Other reviews under "Indexing Methods":	Date

Computation of term/document discrimination values by use of the cover coefficient Can F. (ed), Ozkarahan E. Journal of the American Society for Information Science 38(3): 171-183, 1987. Type: Article	Mar 1 1988

Automatic indexing of full texts Jonák Z. Information Processing and Management: an International Journal 20(5-6): 619-627, 1984. Type: Article	Jul 1 1985

Evaluation of access methods to text documents in office systems Rabitti F., Zizka J. Research and development in information retrieval (, King’s College, Cambridge,401984. Type: Proceedings	Sep 1 1985

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy