An old rule of thumb suggests that 90 percent of all potentially relevant business information is in unstructured form. Hence, it is no surprise that many mathematically ill-defined problems associated with text analysis have attracted a lot of attention from data mining researchers. Text data management is a more mature field, and its associated text data access problems are tackled with the help of information retrieval techniques, as the popularity of web search engines attest. Zhai and Massung have managed to write a very readable introduction to both fields and their state of the art in 500 pages.
After the usual introductory chapters, which include some background information and a very cursory mention of natural language processing (NLP) techniques, they delve into text data access methods, also known as information retrieval. Here, they discuss basic techniques such as ranking documents in response to a user query. They gently introduce retrieval models and the rationale behind them until they logically reach state-of-the-art vector space models, namely pivoted-length normalization and the Okapi BM25 ranking function. They also cover probabilistic models and, by clever use of analogies with the heuristic models, clearly explain the query likelihood retrieval model and the smoothing methods often used with it. Their discussion is not only theoretical, since they also cover practical issues associated with the implementation of information retrieval systems and, as you may expect, web search engines as the most prominent example of information retrieval systems nowadays. Their analysis of web search includes crawling, indexing, and link analysis, with the usual description of Google’s PageRank and Kleinberg’s HITS. The information retrieval half of this book is completed with short chapters on feedback (that is, how to take into account a user’s actions to improve information retrieval results) and recommender systems, which provide relevant information to the user in “push” mode (in contrast to the “pull” mode of search and browsing, when the user initiates the requests).
The second half of Zhai and Massung’s textbook focuses on text mining, “text analysis” using the authors’ preferred term. Word association mining, text clustering, text categorization, text summarization, topic modeling, opinion mining, and sentiment analysis are the main text mining problems studied in this second half of the book. Many of the discussed techniques are unavoidably application-specific, hence the authors’ emphasis on the importance of feature engineering for solving problems such as text categorization, sentiment analysis, or text-based prediction. Their coverage of different problems is not without stark contrasts. For instance, a 60-page guided tour on probabilistic topic modeling, where probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) are excruciatingly dissected, is followed by a shallow overview chapter on opinion mining and sentiment analysis. In this short chapter, text data is regarded as data generated from humans as subjective sensors, which enables mining knowledge about the human observer who generated the text data. The subjective content of text data is then analyzed using techniques such as ordinal logistic regression or latent aspect rating analysis (LARA), proposed by the first author in two KDD papers [1,2].
The text mining half of the book ends with a 30-page survey chapter on the joint analysis of text and structured data, which is a requirement in many real-world applications. In fact, non-text data can enrich text analysis, whereas text data can help interpret non-text data (for example, pattern annotation). Three example techniques illustrate how topic analysis can be combined with non-text data in different domains: the use of different views in contextual PLSA, the network supervised topic model in NetPLSA (for the joint analysis of text and social network data), and iterative causal topic modeling for the analysis of text associated to time series.
The book’s final chapter is a short position paper where the authors advocate for integrated software frameworks that support both text management (that is, information retrieval) and text analysis (that is, text mining). It can be read as a broad-brush of the essentials for future unified systems.
In general terms, the authors typically provide verbose descriptions of the reasons behind the design of specific techniques, with numerical examples and illustrative figures from the slides of two massive open online courses (MOOCs) offered by the first author on Coursera. They also provide specific sections that describe in detail the proper way to evaluate every different kind of technique, a key factor to be taken into account when applying the discussed techniques in practice.
The book, however, is not always self-contained, since its broad scope in a limited number of pages entails an unavoidable depth/breadth tradeoff. Most basic techniques can be implemented just by following the instructions and guidelines in the text, although interested readers might need to resort to the bibliographic references if they want to gain a thorough understanding of the many advanced techniques. Fortunately, the authors include some bibliographic notes and very selective suggestions for further reading at the end of each chapter, instead of the encyclopedic collection of references common in many other textbooks.
Although readers will not find detailed coverage of NLP techniques and some chapters might seem lacking in depth, advanced undergraduate students might find this book to be a valuable reference for getting acquainted with both information retrieval and text mining in a single volume, a worthwhile achievement for a 500-page textbook.