This book strikes a balance between a step-by-step guide to many text mining techniques and a more general tutorial on cleaning, preprocessing, analysis, and visualization with R. Despite its emphasis on text mining, the book actually covers many fundamental concepts in programming and data manipulation. In fact, the material is probably sufficient for a very decent introductory programming course, and interested students should be able to add to the repertoire of techniques provided in the book with some ease. The book is also written in a way that it can be easily navigated from more complex tasks back into the explanation of their building blocks described in earlier chapters. Thus, it also serves as a good reference for a more experienced researcher.
The book is divided in three parts, ordered in increasing sophistication (and thus complexity) of the analysis techniques covered: microanalysis (chapters 1 to 5), mesoanalysis (chapters 6 to 10), and macroanalysis (chapters 11 to 13). In each part, new data cleaning and preprocessing steps are introduced, together with new programming concepts. It is also worth noting that the author provides an extensive code library to get the reader started with useful analysis right in chapter 1. As a result, this book avoids the dreadful 200 pages full of dry discussion about types and other details that are so common in many programming books available today.
In Part 1, the reader is introduced to the programming environment in R, and to fundamental statistical computations that can be performed with minimal preprocessing and cleaning of the input text. Chapter 1 is dedicated to downloading and setting up R, which is not the simplest of tasks. By the end of chapter 2, the student will have accomplished quite an impressive list of tasks, from loading and preprocessing data all the way down to plotting a chart with results. In order not to overwhelm the student, Jockers makes great use of footnotes and pointers to further reading, keeping the details out of the way yet always accessible. The subsequent chapters guide the student through different kinds of useful plots in R, statistical concepts such as frequencies and correlation, as well as programming concepts including assignments, scalar and indexed data types, input/output, regular expressions and string matching, and iteration with for loops.
Part 2 ups the tempo a bit, providing the student with more complex tasks that build on the smaller ones described in previous chapters. However, as a good teacher, Jockers reinforces the key ideas from the previous chapters with new and larger examples, and constantly refers the reader back to chapters for further details. This makes the book both accessible to more experienced programmers who wish to skip the basics as well as a good reference. This section describes several new statistical analysis and plotting commands, and teaches the student how to create a function or subroutine, which is another fundamental programming concept. The last chapter in the section provides code for reading Extensible Markup Language (XML) documents encoded according to the text encoding initiative (TEI) format, which will certainly come in handy for the avid digital humanists.
In Part 3, the author introduces fairly advanced tasks in text mining together with one key idea in programming: code reuse by importing external tools as libraries. The analysis described here is much more involved and thus much harder to interpret than before. Moreover, this material might require more maturity from the students, being perhaps inappropriate for an introductory course. Nevertheless, the book does a very good job in helping the reader navigate through the complexity of these tasks.
To conclude, this is a remarkably well-crafted book that will allow students to get a quick start and progress toward quite sophisticated text mining tasks. Naturally, the programming techniques together with the data cleaning and visualization approaches described in the book are transferable to other domains. Also, the set of practice exercises provided at the end of each chapter, with solutions at the end of the book, should serve well to help students solidify their knowledge and gain more confidence in their text mining skills. All in all, this book would be a great addition to the libraries of digital humanists and natural language enthusiasts who wish to expand their programming literacy and/or their text mining toolkits.