For each database in a document retrieval system, there is usually associated a thesaurus of terms which are used to index and retrieve the documents. The structure of thesauri is generally hierarchical; that is, there exists a “tree” of several levels of increasingly more specific terms.
An important current problem in the field of information retrieval is the need to fashion retrieval techniques which will be useful in a network of heterogeneous databases where a different thesaurus may be associated with each database. In this situation, any two thesauri will have some, but not all, of the same terms and hierarchical relations.
Mazur’s contribution is to develop a mathematical model of this situation with formalized definitions of such entities as an individual (local) retrieval system and its thesaurus, document collection, and query and retrieval sets. Futhermore, the formalization of a distributed system made up of a number of local systems is developed. In some simple situations, certain mathematical properties of the relationship between the local and distributed systems are derived.
The author has made a nice start in formalizing this situation. Unfortunately, there are four major kinds of difficulties that need to be overcome before this kind of work can be truly useful. First, the mathematical descriptions and relationships have to be clearly associated with known and understandable features of retrieval systems. This is a question of good, interpretive exposition, and should be doable.
The second difficulty is that modern retrieval systems are quite complicated. Besides indexing and searching by “controlled-vocabulary” terms from a thesaurus, there is free-vocabulary indexing (any word from titles and abstracts) and there is searching by masking, truncation, proximity, field specification, and weighting operations. The third difficulty is that even though the same terms superficially are used in indexing and searching, they may have different meanings in different contexts.
The fourth difficulty is that the utility (and, therefore, cost effectiveness) of any system is bound up in retrieving relevant and useful documents. Both of these parameters, as well as the “meaning” one, are highly subjective and not easily captured with formalized, mathematical constructs. These last three difficulties pose formidable problems for the formalization and modeling of document retrieval systems, in general, and distributed systems, in particular.