One of the most rapidly growing sources of data, natural-language text, is also one of the most difficult to analyze. Computerized understanding of natural language was among the earliest anticipated benefits of artificial intelligence (AI), but it has proven extraordinarily challenging. This volume offers a selective introduction to the state of the art of computerized analysis of text. As befits the subtitle, “a practical introduction ...,” it situates the techniques it explains in the context of a systems view that emphasizes how natural-language processing (NLP) can be applied in real applications.
Chapter 1 introduces the overall framework, distinguishing analysis of the text from various organizational processes (including search, filtering, categorization, summarization, topic analysis, information extraction, clustering, and visualization) that support the two main objectives of retrieval operations and data mining. With the exception of information extraction and visualization, the book discusses each of these operations.
Chapter 2 provides an overview of mathematical background in probability and statistics, information theory, and machine learning. Chapter 3 reviews the history of NLP and text data understanding. Most of the book is limited to a bag-of-words model, though this chapter acknowledges more sophisticated techniques.
Chapter 4 introduces the authors’ modern text analysis (MeTA) toolkit for text data management and analysis, encouraging readers to download the open-source C++-based system and use it in examples and exercises promised later in the text. This promise of a hands-on learning experience is only partly fulfilled. Few exercises, and even fewer examples in the body of the text, actually say anything about MeTA. Most of the exercises that do mention it do not use it to illustrate a particular text-analytic function, but ask the user either to look to see how MeTA implements a given text-analytic function, or to extend MeTA to do something discussed in the text. Both kinds of task require the reader to delve into the source code of MeTA rather than use the functionality of the package, and thus assume a level of knowledge about MeTA well beyond anything in the text. These exercises might be useful in the context of a class where the instructor is already acquainted with the internal design and implementation of MeTA. Some other toolkits are mentioned, but there is no reference to other, important ones, such as MALLET from the University of Massachusetts at Amherst.
After these four introductory chapters, the rest of the book has three parts: seven chapters devoted to accessing textual data, eight to analyzing it, and one final chapter fleshing out an overall architecture for unified text management and analysis.
The chapters on accessing data discuss retrieval models, how the information retrieval system gets feedback from the user, implementation and evaluation of search engines, a special chapter on web-based search, and recommender systems. Most chapters are about 20 pages long (the median chapter length for the book is 18 pages), but the chapter on retrieval models is 46 pages long. The extra detail is useful, given the importance of this theme, but it is uneven compared with the rest of the book. The selection of retrieval methods to discuss is not clear. Early in the chapter, the authors identify “four major models that are generally regarded as state of the art: pivoted length normalization, Okapi BM25, query likelihood, and PL2.” However, the rest of the chapter mentions PL2 only in passing, focusing instead on two forms of smoothing for query likelihood, JM smoothing and Dirichlet prior smoothing. The chapter does not discuss two very important issues in the area of retrieval, van Rijsbergen’s work on The geometry of information retrieval , and the particular challenges posed by comparing vectors in high-dimensional spaces, which characterize most keyword-based retrieval methods.
The text analysis chapters discuss word association mining, text clustering, categorization and summarization, topic analysis, opinion mining and sentiment analysis, and the joint analysis of text and structured data. Again, the level of detail is uneven. The median chapter length in this section is 24 pages, but the chapter on topic analysis occupies 60 pages. Again, the theme is an important one, but the level of detail appears to be out of balance with the rest of the book.
The book includes exercises with each chapter, appendices giving further details on mathematical methods mentioned earlier in the book (Bayesian statistics, expectation maximization, and KL-divergence and Dirichlet prior smoothing), copious references, and an index. The references usefully include the page numbers on which they are cited, but there is some irregularity. For example, van Rijsbergen’s important volume  is listed twice in the references, once alphabetized under R, and again under V.