The main idea of this book, based on the author’s PhD thesis, is to use markup information as a series of cues to the significance of words and concepts in a text, thus enhancing the indexing of that text. The technique is developed for collections of texts with a specific focus, such as a Web site or a collection of documents maintained in an organization.
The first part of the process is the automatic construction of domain models. Concepts can be extracted automatically and straightforwardly, based on cues such as terms appearing in multiple markup contexts (or even multiple fonts) in a document. These concepts are then organized into a model for the domain from which the texts are drawn. Specifically, they are built into hierarchies.
The second part of the process uses this model to interactively refine user searches. The hierarchies can be used to shape an interactive dialogue with a user conducting a search: when a search is completed, the dialogue manager can offer options such as refining the search into more detailed concepts included in the collection of texts, or synthesizing the search into a search on a more global and inclusive context. The dialogue manager idea is general, so the software can be structured as a user interface to an existing search engine (such as Google), or can be used as the entry point for a custom engine. Two different applications are described in this part. The first involves indexing Web pages, and is given in two variations: a university Web site and a BBC news Web site. The second application involves searching a directory of classified advertisements.
The presented approach is attractive because it can be adapted to different contexts in a straightforward manner, and is simple both to explain and to implement.