Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
On a combination of probabilistic and Boolean IR models for WWW document retrieval
Yoshioka M., Haraguchi M. ACM Transactions on Asian Language Information Processing4 (3):340-356,2005.Type:Article
Date Reviewed: Jun 5 2006

The authors presume that readers will have substantial knowledge of the details of query reformulation in the context of information retrieval (IR), including background material, acronyms, and a significant familiarity with some of the mathematical formulae in common use in this specialized subfield. In addition, the paper is at once dense and yet sparse on some key details. In preparing this review, I found Crouch et al. [1] and Azzopardi’s [2] works very helpful to overcome my own deficiencies in some of these respects. Thus, this paper can be recommended only to subspecialists.

The paper is based on the concept of query reformulation, which is defined in Crouch et al. [1] as:

a technique ... successfully used to enhance ... retrieval effectiveness of queries. [Using] [r]elevance feedback ... the query is automatically reformulated based on information contained in the original query and in retrieved documents judged relevant (and non-relevant) by the user. A similar approach is pseudo- (or pseudo-relevance) feedback, wherein a certain set of documents is assumed relevant ... and ... is then used to reformulate the query (pages 1-2).

“[This] requires no interaction with the user” and it turns out to work, however surprising that may be.

This paper’s contribution is a new method based on the general aims quoted above from Crouch et al. [1]. Methods are judged effective if they have high precision, “the proportion of retrieved information that is actually relevant” [2], and high recall, “the fraction of relevant documents that have been retrieved” [2]. A bit of thought will convince the reader that these two aims are naturally antagonistic in the absence of very long, specific queries.

The assumption behind all these methods is, as put by Azzopardi [2], that the user’s mind is a “noisy channel,” meaning that the user is not really sure what he is looking for, although he recognizes it when he sees it. Moreover, different users have different ideal documents in mind, yet they express their needs with identical queries. The combination of these factors is what has led the IR community to accept pseudo-relevance as a valid concept--that, and the fact that it works.

The new method uses the most relevant five documents containing all the terms (meaning nouns, and only those nouns that are syntactically nouns, thereby excluding verbal uses that serve the semantic purpose of nouns) or phrases from the original query, and finds additional terms (but not phrases) that frequently co-occur in the relevant documents to generate new pseudo-relevant documents. Probabilistic analysis is applied to allow the presence of some documents in the final result set without all of the original terms or phrases, but with, instead, what are presumed to be complementary (synonymous) terms.

The authors have tested their method extensively and are convinced it works. Although I am unable to independently assess this claim (partly because the language used for terms was Japanese and partly because of the density and sparseness of the presentation), I certainly do believe it.

Yet, I would hope that this is not the wave of the future. Search engines, for all their prominence and ubiquity, are still a new tool, and are not properly used. One of the key problems is that users rely on the first few results too often [1,2], do not formulate queries thoughtfully, and the like. Time will improve these behaviors, even as time has tempered the urge to send out email too quickly or without running a spelling check. In other words, it would be better for the user’s mind to be less of a noisy channel than to second-guess it by automatic means, which will only weaken critical thinking. Moreover, today’s search engines are extremely primitive; their extraordinary usefulness comes almost exclusively from their vast size. Various proximity operators, sensitivity to special symbols, and a variety of other searching enhancements available in commercial online databases could easily be added to search engines, and would render research of this type far less necessary.

Reviewer:  Joseph S. Fulda Review #: CR132877 (0704-0397)
1) Crouch, C.J.; Crouch, D.B.; Chen, Q.; Holtz, S.J. Improving the retrieval effectiveness of very short queries. Information Processing and Management 38, (2002), 1–36.
2) Azzopardi, L. Incorporating context within the language modeling approach for ad hoc information retrieval, PhD dissertation. University of Paisley, 2005, http://www.cis.strath.ac.uk/~leif/downloads/azzopardi2005thesis_final.pdf.
Bookmark and Share
  Featured Reviewer  
 
Query Formulation (H.3.3 ... )
 
 
Query Formulation (H.3.3 ... )
 
 
Relevance Feedback (H.3.3 ... )
 
 
Information Search And Retrieval (H.3.3 )
 
Would you recommend this review?
yes
no
Other reviews under "Query Formulation": Date
A comparison of two methods for Boolean query relevancy feedback
Salton G., Voorhees E., Fox E. Information Processing and Management: an International Journal 20(5-6): 637-651, 1984. Type: Article
Jul 1 1985
Calibrating databases
Fischhoff B., MacGregor D. Journal of the American Society for Information Science 37(4): 222-233, 1986. Type: Article
Sep 1 1987
Space-time trade-offs for orthogonal range queries
Vaidya P. SIAM Journal on Computing 18(4): 748-758, 1989. Type: Article
Oct 1 1990
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy