Computing Reviews, the leading online review service for computing literature.

Search

A nonparametric term weighting method for information retrieval based on measuring the divergence from independence
Kocabaş İ., Dinçer B., Karaoğlan B. Information Retrieval17 (2):153-176,2014.Type:Article

Date Reviewed: Oct 28 2014

In information retrieval (IR), it is important to use effective term-weighting schemes. Existing works focus on the divergence from randomness (DFR) scheme, which assumes precise shape of the actual distribution of term frequency in the dataset. To overcome this issue, the authors propose a nonparametric approach, called divergence from independence (DFI), where a specialty word occurs in its semantically related content with a frequency that is different from that of a non-specialty word. Based on DFI, the authors introduce three models: saturated model, standardization, and normalized chi-squared distance. Subsequently, they compare the retrieval accuracy that is obtained by these models, their variants, and five DFR-based approaches. The experimental results indicate that the DFI model that is based on standardization using inverse document frequency (IDF) works well for well-structured documents (such as newspapers and congressional records), while the DFI model that is based on normalized chi-squared distance works well for non-controlled documents (such as web pages). Although the experiments reported in the paper are interesting, additional experiments would help to provide a better picture of the quality of the proposed models. For example, it has been reported in other works that log entropy is effective in the case of scholarly papers [1], which are a type of well-structured document. Hence, the authors should have compared their approach with log entropy in the context of scholarly papers. In addition, it is important to highly rank relevant documents in IR systems. Therefore, the authors should have evaluated their approach in terms of how many relevant documents are ranked in the top 1,000 results (1,000 precision) rather than the total number of relevant documents retrieved. Recently, social media content, which consists of short text, has emerged. Some works address this point by improving classical term-weighting methods [2,3]. It would have been nice if the authors had extended their experiments to cover social media content. This paper will be helpful for IR researchers, especially those developing IR models.

Reviewer: Kazunari Sugiyama	Review #: CR142865 (1501-0083)

1)	Dumais, S. T. Improving the retrieval of information retrieval from external sources. Behavior Research Methods, Instruments & Computers 23, 2(1991), 229–236.

2)	Phan, X.-H.; Nguyen, L.-M.; Horiguchi, S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proc. of the 17th International Conference on World Wide Web (WWW ’08). ACM, (2008), 91–100.

3)	Naveed, N. ; Gottron, T.; Kunegis, J. ; Alhadi, A. C. Searching microblogs: coping with sparsity and document quality. In Proc. of the 20th ACM International Conference on Information and Knowledge Management (CIKM ‘11). ACM, (2011), 183–188.

Information Search And Retrieval (H.3.3 )

Online Information Services (H.3.5 )

Systems And Information Theory (H.1.1 )

Would you recommend this review?

yes

Other reviews under "Information Search And Retrieval":	Date

Nested transactions in a combined IRS-DBMS architecture Schek H. (ed) Research and development in information retrieval (, King’s College, Cambridge,701984. Type: Proceedings	Nov 1 1985

An integrated fact/document information system for office automation Ozkarahan E., Can F. (ed) Information Technology Research Development Applications 3(3): 142-156, 1984. Type: Article	Oct 1 1985

Access methods for text Faloutsos C. ACM Computing Surveys 17(1): 49-74, 1985. Type: Article	Jan 1 1986

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy