The most common approach to the implementation of information retrieval (IR) systems is the term frequency-inverse document frequency (tf-idf) model. In this model, in a given document, a word with high document frequency and low collection frequency would have a relatively higher importance in document ranking. The implementation of the model is realized using an inverted file structure, which provides an efficient retrieval environment, especially with the use of dynamic pruning techniques introduced for the Web environment [1].
This work introduces two IR methods, and compares their retrieval performance with that of the tf-idf model. The first method uses dependency relationships between words in a sentence. The second method uses proximity relationships, mainly ordered cooccurrence information of words in a sentence, to approximate the dependency relationships between words. For relationship representation, a structured index in the form of a binary tree is constructed for each document. The index creation involves morphological, dependency, and compound noun analyses. The same structure is also constructed for the queries, which are expressed in the form of complete sentences. The document and query index structures are together used in the search process. The authors show that in Japanese IR, using full sentence queries, these methods, with properly chosen parameters, are superior to the tf-idf model, and can increase IR effectiveness by up to 22 percent.
The presentation in the paper flows nicely. However, the authors’ claim regarding the superior performance of the methods “independently of the target collection and search topic set” is too strong, since the experiments are based on the subsets of the National Academic Center for Science Information Systems (NACSIS) Test Collection for IR systems (NTCIR-1). Additionally, there are a number of issues and concerns to be addressed. My major concern is the use of full sentence queries. For example, most Web queries consist of few words. Hence, the approach is hard to generalize to most real-life situations. Furthermore, a comparison with IR techniques that use query term distance information in a statistical sense is missing. Search efficiency is an important issue that is not addressed, but it is on the future research agenda of the authors.
The NTCIR -1 test collection being used in the experiments contains 330,000 documents, and 83 queries (topics). The authors indicate that it is the largest Japanese IR test collection. Its size, when compared with the sizes of some English TREC test collections, is relatively small. This observation on Japanese IR is an indication of the existence of room for more aggressive IR research in (probably most) non-English languages.