Today’s society is inundated by data. An extraordinary collection of sensors orbits the earth, collecting information on climate change. Digital medical devices continue to increase in number and applicability, improving diagnostics and care. Even basic activities such as shopping or renting a movie are customized by retailers, based on customers’ previous behavior. For these activities to be accurate, data needs to be modeled and processed efficiently. This book aims to provide guidance on how to handle data in our information-driven world. Written by a senior National Institutes of Health (NIH) scholar and published by a prestigious publisher, the text has the advantage of being well written and homogeneous. The following provides a brief analysis of its contents.
Klemens’ intention is, indeed, to take you from the beginning. Data modeling today can and should be done only using computers. The author suggests that you should consider learning a programming language (C), as well as familiarizing yourself with a database environment (SQL). For this, he provides an extensive primer that corresponds to roughly half of the book (chapters 2 to 6 and two appendices), with the second half being dedicated to statistical methods and result interpretation (chapters 7 to 11). This approach is sure to raise some eyebrows. First, while C has been extensively used for numerical methods, it is by no means a straightforward tool to use, nor is it the one that may be employed by a beginning researcher, as the author suggests. Dedicated commercial packages exist, albeit for a price. Second, the author’s efforts to help you learn the language are shadowed by an extraordinary array of standalone textbooks focused on programming. Such books, while possibly too extensive for a quick learning strategy, include components that are missing here, such as software development principles, an integrated development environment (IDE), proper code documentation, and coding style. To Klemens’ merit, following the initial chapter aimed at basic programming, the text includes several other programming chapters focused on the GNU Scientific Library (GSL) and Apophenia numerical libraries, topics that one would not find in a regular programming book. Similarly, the quick introduction to SQL is welcome, although the author should have also considered a discussion of other database environments currently available.
In the second part of the book, the programming basics are nicely employed to explain statistical methods. From an ever-increasing number of topics, the author chose several fundamental ones: distributions (chapter 7), linear transforms--think principle component analysis (PCA) (chapter 8), hypothesis testing (chapter 9), maximum likelihood estimation (chapter 10), and Monte Carlo simulations (chapter 11). The author draws from his vast experience to present the material in a friendly manner, starting from basic theoretical concepts and ending with code. A particularly helpful aspect is the use of side boxes, such as vocabulary and metadata, that provide additional insight for the reader.
In the preface, Klemens suggests that the book be used by graduate students and beginning researchers. This may very well be appropriate, given the well-written text and the easy-to-follow sequence. Each chapter includes a number of quick questions intended to reinforce the learning. Such questions may in fact constitute the start for designing testing tools, if one were to use this as a textbook for a class. The reading will not be complete without visiting the author’s Web page [1]. The online repository not only provides access to the code examples and additional programming support, but more importantly allows you to read the author’s thoughts on various issues, through his well-maintained blog.