Computing Reviews

Foundations of predictive analytics
Wu J., Coggeshall S., Chapman & Hall/CRC,Boca Raton, FL,2012. 337 pp.Type:Book
Date Reviewed: 08/21/12

Wu and Coggeshall’s monograph surveys many of the techniques currently employed to analyze data. The authors also sell a commercial Microsoft Excel add-in for building predictive models, DataMinerXL (http://www.dataminerxl.com/), which implements many of these techniques.

According to my Merriam-Webster dictionary, the term “analytics” has been used since the 16th century to refer to the method of logical analysis. Nowadays, however, it usually serves as a fine-sounding replacement for data mining, the discovery of meaningful patterns in data.

At the crossroads of machine learning and statistics, analytics (or data mining, if you like) has become a key area in computer science. I wouldn’t go as far as Hal Varian, Google’s Chief Economist and a Berkeley professor, who claimed that “the sexy job in the next ten years will be statisticians” [1]. Nonetheless, as he explained,

The ability to take that data--to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it--that’s going to be a hugely important skill in the next decades ... because now we really do have essentially free and ubiquitous data. So the complementary scarce factor is the ability to understand that data and extract value from it. [1]

The authors’ writing style is quite formal, and their book is aimed at mathematically oriented readers. They justify their approach: “In order to apply different techniques to different problems appropriately, it is essential to understand the assumptions and the theory behind each technique.” This sound line of reasoning leads them to include step-by-step derivations from the underlying formal assumptions in the final theoretical results. They also claim to include discussions on practical subjects missing from other texts, but I am afraid that they overstate this facet of their book, as well as its self-containedness.

As a matter of fact, this book reads more like a set of notes for personal use than a textbook for learning new techniques. As such, it is more useful as a reference, since it tries to capture myriad hard-to-find connections and factoids of interest to data analysts. From the formal properties of statistical distributions, to the application of the Dempster-Shafer theory (DST) of evidence, to classifier ensembles, the book offers a suggestive overview of countless ideas, techniques, and results, but often without much guidance for the reader.

This book is crammed with information on the modeling process, statistical distributions, matrix algebra, linear regression, linear discriminant analysis, principal component analysis, Bayesian classifiers, neural networks, support vector machines, nonparametric classifiers (nearest neighbors), clustering, fuzzy logic, time series analysis, survival data analysis, data preprocessing, variable selection, mutual information, multicollinearity detection (that is, methods for identifying and removing highly correlated variables), model goodness measures for discrete and continuous dependent variables, a long roster of optimization methods, and a miscellany of topics such as multidimensional scaling, simulation, odds normalization, reject inference, and the aforementioned DST. Given that all of these topics are covered in about 300 pages, which also include more than 1,500 equations and dozens of references to specific DataMinerXL functions, little room is left for the succinct explanations you will find in this book (where they exist).

This extreme conciseness comes at a cost, since the required background information might not be readily available to many readers, and the general lack of flow in the text might dissuade them from delving deeper into the subject. The order of presentation is also somewhat cumbersome, with an excess of forward references. For instance, the chapter on optimization methods appears near the end of the book, rather than accompanying the propaedeutic chapters on statistics and algebra, while the chapters on linear and nonlinear models precede the chapters on data preparation and model goodness measures, when the reverse order might make more sense.

This book is certainly not for neophytes, and maybe not for many self-taught practitioners. However, if interpreted as an annotated collection of notes and interesting factoids, it offers some valuable nuggets here and there.


1)

Varian, H. Hal Varian on how the Web challenges managers. Interview by James Manyika. McKinsey Quarterly (Jan. 2009), http://www.mckinseyquarterly.com/Hal_Varian_on_how_the_Web_challenges_managers_2286.

Reviewer:  Fernando Berzal Review #: CR140519 (1301-0015)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy