Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Language and representation in information retrieval
Blair D., Elsevier North-Holland, Inc., New York, NY, 1990. Type: Book (9780444884374)
Date Reviewed: Nov 1 1990

I am torn between admiration and distress. One half of this volume (chapters 4 and 6) treats aspects of the philosophy of language and the philosophy of science, respectively. These topics are only remotely connected with information retrieval as it is normally defined, but the text offers many insightful comments and interesting ideas in these areas. Being largely unfamiliar with language theories, I was impressed and challenged by the author’s knowledge and erudition.

The same, unhappily, cannot be said about the other half of the book (chapters 1–3 and 5), which covers certain aspects of text analysis and retrieval more directly. In these areas, the author follows his own agenda and the treatment is fragmentary and inadequate.

Blair uses the following main line of argument: The known research output in text analysis and information retrieval is based on work performed with small sample document collections; specifically, the evaluation studies demonstrating the usefulness of modern retrieval techniques are applicable only to such small, nonrepresentative sample collections. Small laboratory collections and large operational ones existing in the real world are very different not only quantitatively but also qualitatively. Since the activities of small-scale and large-scale retrieval systems are so different, the text representation languages used in these activities must also be different. In particular, it is unthinkable that keyword systems (where document content is represented by sets of manually or automatically assigned keywords) would operate satisfactorily with large collections. The content representation of texts is crucial, and before any progress can be made in retrieval, a workable theory of word and text meaning needs to be developed.

In considering the problems of text analysis and representation of text content, it is easy to agree that a complete and usable theory of meaning would be very nice to have. Accurate meaning representations could then be attached to stored texts, and retrieval performance would be much enhanced. Unhappily, the prospects in this area are not good, and for the foreseeable future we must learn to get along without a practical semantic theory. Fortunately, the existing large-scale retrieval systems have done quite well even without an acceptable semantic theory: hundreds of thousands of text searches are run every month around the world, and many users, including whole classes of professional people, have come to depend on automated retrieval systems as a part of their routine daily endeavors.

The author postulates that it is impossible that these users receive satisfactory service. Many of them do, however: not only have many system users expressed their satisfaction with the existing retrieval services, but various studies have been carried out assessing the effectiveness of large-scale retrieval systems, and objective performance measurements indicate that a reasonable degree of retrieval effectiveness is possible. For example, Cleverdon obtained a high standard of performance with a NASA database of 44,000 documents (recall of 0.78 and precision of 0.63) [1, 2]. The author fails to mention any of the extensive literature that could disprove his theories, preferring instead to cite only some early tests performed in the mid-1960s, which he rejects based on conjectures about possibly flawed collection compositions and possibly biased user populations [3,4].

Assuming now that existing retrieval systems are in fact inadequate, either in theory or in practice, one wants to know what can be done to improve them. The author’s answer is not much, short of the introduction of new representation languages for text meaning that are not likely to be available in the foreseeable future. Many possible theories and procedures that are easily implemented and could substantially enhance the performance of existing systems go unmentioned in this text, including the theory of keyword weighting, which leads to the assignment of weighted keywords to identify document content instead of the currently used unweighted ones, and the generation of ranked output, where the retrieved items are presented to the user in decreasing order of presumed relevance to the search queries [5].

When the author discusses a particular enhanced retrieval method, such as the well-known relevance feedback process (in which improved query formulations are automatically constructed based on relevance assessments provided by the user for certain previously retrieved items), the treatment is so misleading that it would have been better omitted. The author has this to say about relevance feedback:

Several difficulties have prevented these (relevance feedback) techniques from being incorporated into the design of commercial large-scale retrieval systems. In the first place, the study of these techniques was conducted on exceptionally small databases of 200 documents.… There have been no reported tests of these techniques on large systems. But there are theoretical reasons why we could expect that their performance would not improve retrieval on systems with larger databases (pp. 250–251).

Actually, relevance feedback represents one of the few easily used techniques that can vastly increase the performance of retrieval systems. Relevance feedback techniques have been incorporated into dozens of modern retrieval environments, including the system designed to process the Reuters newswire messages using Connection Machine searches [6]. Furthermore, the available evaluation results obtained with several document collections of reasonable size demonstrate that relevance feedback provides impressive performance improvements in a variety of collection environments [7].

The treatment of relevance feedback in this book is typical: instead of citing modern work that contradicts or challenges his notions about the inherent unworkability of modern retrieval techniques, the author gives a few early results, normally obtained with tiny test collections in the 1960s, and then cites the “lack of upscaling” argument to “demonstrate” that these results are not generally valid. Since this sterile exercise is repeated many times, I hope this book will not fall into the hands of novices and other readers who may be unable to supply the missing context. Such people would certainly be misled by this somewhat fanciful description of current conditions and future prospects in the field.

Fortunately, it is easier to talk about the much more edifying other half of the book, especially chapter 4, which covers the philosophy of language. In this chapter, which accounts for nearly 40 percent of the text, the author traces major developments in semiotics and introduces a number of semantic language models. He rejects “mentalistic, behavioristic and representational” theories of language in favor of an “implementational” view of language due largely to the influential writings of Ludwig Wittgenstein.

In Wittgenstein’s view words are like tools: they can be used for many purposes, and the appearance of a word in a text indicates nothing definite by itself. Because most words have many functions, a word’s meaning cannot be ascertained readily by looking in a dictionary or thesaurus. Instead, just as the function of a tool is determined by the particular task which is carried out, the meaning of a word is determined by its use. To quote Wittgenstein,

For a large class of cases--though not for all--in which we use the word “meaning,” it can be defined thus: the meaning of a word is its use in the language [8].

Accordingly, Blair says that we need contextual descriptions for the linguistic components, where context means not only the linguistic environment in which the words appear but also the document context, including the function, purpose, and general environment of the text:

In information retrieval systems, two areas of development need to be pursued: the contextual dimensions of document representations need to be expanded, and subject descriptions need to be related to these documents or categories (p. 183).

The problem then consists of capturing the contextual dimension needed for proper content representation. The author refers to John Searle, who distinguishes the “brute facts” of the language from the “institutional facts” [9]. Brute facts are computable facts, such as statistical and structural data that are directly obtainable from the available texts. Institutional facts, on the other hand, are facts that require human institutions and background for interpretation. The claim is made that word meaning depends critically on the institutional background, and that this background cannot be derived from even the most complete and carefully constructed set of brute facts. As the author says, “the goal of any document indexing strategy should be to build as much of the missing…institutional context into the language representation” (p. 323).

And so it seems we are stuck, because we have no good way of supplying the complete institutional context in which a text is placed, and the brute language facts that we can actually generate are apparently insufficient for a proper content description. The author gives up at this point, but the reader has to wonder whether compromises are not possible after all. For example, instead of supplying complete institutional information as part of the document indexing, it might be enough to make sure that the terms chosen for the content representation are free of ambiguity in their particular text environments. In such a case, one would know that the term “base” in environment A refers to “army base” whereas in environment B it represents “lamp base” or “baseball base” as the case may be. For this simpler type of term disambiguation, brute linguistic know-how might suffice.

In fact, the strict separation between brute and institutional facts proposed by Searle and Wittgenstein may not be so compelling after all. Brute (statistical) data derived from large text samples have been used in the past for noun phrase disambiguation in linguistic analysis systems [10,11], and even machine translation systems have been based on brute language facts [12]. The author does not pursue this line of thought, evidently preferring to maintain that something essential is necessarily missing in any text analysis based on brute facts alone. Be that as it may, the author’s treatment certainly leads to speculation about these and other avenues for advancing text understanding.

I highly recommend the philosophical half of this book (chapters 4 and 6) to information retrieval experts and others who like the challenge of a new point of view. One must be grateful to the author for helping to broaden the information retrieval context in this way.

It is therefore especially sad to report that the publisher has totally failed to do justice to this book. The same ugly typeface is used throughout, with no italics, no special set-up for equations, vastly oversized margins, and generally poor page design. The price of nearly $90 for a 330-page book is an insult. Why do publishers bother to put out a piece of work when they are unwilling or unable to provide the minimum of required care and competence in its production?

Reviewer:  Gerard Salton Review #: CR123399
1) Cleverdon, C. W. A computer evaluation of searching by controlled language and natural language in an experimental NASA database. Report ESA#1/432, European Space Agency, Frascati, Italy, July 1977.
2) Salton, G. Another look at automatic text-retrieval systems. Commun. ACM 29, 7 (July 1986), 648–656.
3) Swanson, D. R. Information retrieval as a trial-and-error process. Libr. Q. 47, 2 (April 1977), 128–148.
4) Swanson, D. R. Some unexplained aspects of the Cranfield tests of indexing performance factors. Libr. Q. 41, 3 (July 1971), 223–228.
5) Salton, G. and Buckley, C. Term weighting approaches in automatic text retrieval. Inf. Proc. Manage. 24, 5 (1988), 513–523.
6) Stanfill, C. and Kahle, B. Parallel free-text search on the Connection Machine system. Commun. ACM 29, 12 (Dec. 1986), 1229–1239.
7) Salton, G. and Buckley, C. Improving retrieval performance by relevance feedback. J. ASIS 41, 4 (1990), 288–297.
8) Wittgenstein, L. Philosophical investigations. Basil Blackwell and Mott, Oxford, UK, 1953.
9) Searle, J. Speech acts: an essay on the philosophy of language. Cambridge University Press, 1969.
10) Garside, R.; Leech, G.; and Sampson, G. (Eds.) The computational analysis of English--a corpus-based approach. Longman, London, 1987.
11) Church, K. A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the Second Conference for Applied Natural Language Processing (Austin, TX, 1988), Association for Computational Linguistics, Morristown, NJ, 1988.
12) Brown, P.; Cocke, J.; Della Pietra, S.; Della Pietra, V. J.; Jelinek, J.; Lafferty, D.; Mercer, R. L.; and Roossin, P. S. A statistical approach to machine translation. Comput. Linguist. 16, 2 (June 1990), 79–85.
Bookmark and Share
 
Search Process (H.3.3 ... )
 
 
Linguistic Processing (H.3.1 ... )
 
 
Natural Language Processing (I.2.7 )
 
Would you recommend this review?
yes
no
Other reviews under "Search Process": Date
Search improvement via automatic query reformulation
Gauch S., Smith J. ACM Transactions on Information Systems 9(3): 249-280, 1991. Type: Article
Jul 1 1993
Criteria for the selection of search strategies in best-match document-retrieval systems
McCall F., Willett P. International Journal of Man-Machine Studies 25(3): 317-326, 1986. Type: Article
Oct 1 1987
The use of adaptive mechanisms for selection of search strategies in document retrieval systems
Croft W. (ed), Thompson R.  Research and development in information retrieval (, King’s College, Cambridge,1101984. Type: Proceedings
Aug 1 1985
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy