Zhai and Massung’s new book Text data management and analysis provides a fresh new look at the areas of text retrieval, text mining, and text management. Traditionally, these three areas are separate, each with a rich collection of research literature and textbooks. Zhai and Massung masterfully weave the contents of these areas together and present students and scholars with a unified view of “everything text,” including a piece of software, META, which is developed by the authors for a variety of text analysis and management tasks. Because of the large scope of the contents, the authors chose to concentrate on the breadth, not the depth, of the knowledge area in this 500-plus-page textbook. The primary audience is upper-level undergraduate or first-year graduate students.
The book contains 20 chapters that are divided into four parts and a few appendices. The first part reviews tools that are needed for the tasks, including probability and statistics, natural language understanding, and the installation and use of the META software. The second part contains the major parts of a traditional information retrieval study. The subjects covered in this part are text retrieval, vector space, and probabilistic models; feedback models; search engine implementation and evaluation; search over the web; and recommendation systems. The third part mainly deals with various text mining–related topics, such as word association mining, text clusters, topic analysis, and opinion mining. The fourth part is a summary of the authors’ views about a unified framework for text analysis and management. There are three appendices that describe some common statistics tools, the Bayesian model, the expectation-maximization model, and KL-divergence and Dirichlet prior smoothing. Each chapter ends with a collection of exercises (about ten in each), which allow readers to assess how well they have learned the content. The exercises with the authors’ software tool META are spread throughout the book.
The authors used this book in one of their (400-level) undergraduate courses and in two massive open online courses (MOOCs), all at the University of Illinois at Urbana-Champaign. Because text analysis and management are such important fields, it is a very good idea to seek ways to teach the topics at the undergraduate or early graduate level. The authors’ approach of unifying text information retrieval and text mining is very refreshing and worth noting. In particular, the authors provided a programming tool that students can use as they learn the course materials. But I think challenges from two aspects remain. One issue is that the mathematics tools needed for text mining are typically out of reach for undergraduate computer science students. It is common practice in undergraduate data mining courses to use packages such as R or Weka to hide the details of statistical analysis. The second challenge is the amount of information covered in the book. It is a great idea to establish a unified framework as the book does. And in keeping the book, and thus the courses using this book, to a manageable size, I agree it is a very good idea to keep a broad view of the topics, without going into depth. But the number of topics covered in the book is vast. It will be a real challenging to use it in undergraduate courses. One may just have to cover selected topics in a typical semester. Regardless, this is a very good attempt to unify two important areas, text retrieval and text mining, for a society in which text analysis is becoming increasingly critical. The book also shows the depth and the breadth of the knowledge of the authors.