Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Foundations of predictive analytics
Wu J., Coggeshall S., Chapman & Hall/CRC, Boca Raton, FL, 2012. 337 pp. Type: Book (978-1-439869-46-8)
Date Reviewed: Aug 21 2012

Wu and Coggeshall’s monograph surveys many of the techniques currently employed to analyze data. The authors also sell a commercial Microsoft Excel add-in for building predictive models, DataMinerXL (http://www.dataminerxl.com/), which implements many of these techniques.

According to my Merriam-Webster dictionary, the term “analytics” has been used since the 16th century to refer to the method of logical analysis. Nowadays, however, it usually serves as a fine-sounding replacement for data mining, the discovery of meaningful patterns in data.

At the crossroads of machine learning and statistics, analytics (or data mining, if you like) has become a key area in computer science. I wouldn’t go as far as Hal Varian, Google’s Chief Economist and a Berkeley professor, who claimed that “the sexy job in the next ten years will be statisticians” [1]. Nonetheless, as he explained,

The ability to take that data--to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it--that’s going to be a hugely important skill in the next decades ... because now we really do have essentially free and ubiquitous data. So the complementary scarce factor is the ability to understand that data and extract value from it. [1]

The authors’ writing style is quite formal, and their book is aimed at mathematically oriented readers. They justify their approach: “In order to apply different techniques to different problems appropriately, it is essential to understand the assumptions and the theory behind each technique.” This sound line of reasoning leads them to include step-by-step derivations from the underlying formal assumptions in the final theoretical results. They also claim to include discussions on practical subjects missing from other texts, but I am afraid that they overstate this facet of their book, as well as its self-containedness.

As a matter of fact, this book reads more like a set of notes for personal use than a textbook for learning new techniques. As such, it is more useful as a reference, since it tries to capture myriad hard-to-find connections and factoids of interest to data analysts. From the formal properties of statistical distributions, to the application of the Dempster-Shafer theory (DST) of evidence, to classifier ensembles, the book offers a suggestive overview of countless ideas, techniques, and results, but often without much guidance for the reader.

This book is crammed with information on the modeling process, statistical distributions, matrix algebra, linear regression, linear discriminant analysis, principal component analysis, Bayesian classifiers, neural networks, support vector machines, nonparametric classifiers (nearest neighbors), clustering, fuzzy logic, time series analysis, survival data analysis, data preprocessing, variable selection, mutual information, multicollinearity detection (that is, methods for identifying and removing highly correlated variables), model goodness measures for discrete and continuous dependent variables, a long roster of optimization methods, and a miscellany of topics such as multidimensional scaling, simulation, odds normalization, reject inference, and the aforementioned DST. Given that all of these topics are covered in about 300 pages, which also include more than 1,500 equations and dozens of references to specific DataMinerXL functions, little room is left for the succinct explanations you will find in this book (where they exist).

This extreme conciseness comes at a cost, since the required background information might not be readily available to many readers, and the general lack of flow in the text might dissuade them from delving deeper into the subject. The order of presentation is also somewhat cumbersome, with an excess of forward references. For instance, the chapter on optimization methods appears near the end of the book, rather than accompanying the propaedeutic chapters on statistics and algebra, while the chapters on linear and nonlinear models precede the chapters on data preparation and model goodness measures, when the reverse order might make more sense.

This book is certainly not for neophytes, and maybe not for many self-taught practitioners. However, if interpreted as an annotated collection of notes and interesting factoids, it offers some valuable nuggets here and there.

Reviewer:  Fernando Berzal Review #: CR140519 (1301-0015)
1) Varian, H. Hal Varian on how the Web challenges managers. Interview by James Manyika. McKinsey Quarterly (Jan. 2009), http://www.mckinseyquarterly.com/Hal_Varian_on_how_the_Web_challenges_managers_2286.
Bookmark and Share
  Reviewer Selected
Featured Reviewer
 
 
Data Mining (H.2.8 ... )
 
 
Knowledge Representation Formalisms And Methods (I.2.4 )
 
 
Reference (A.2 )
 
Would you recommend this review?
yes
no
Other reviews under "Data Mining": Date
Feature selection and effective classifiers
Deogun J. (ed), Choubey S., Raghavan V. (ed), Sever H. (ed) Journal of the American Society for Information Science 49(5): 423-434, 1998. Type: Article
May 1 1999
Rule induction with extension matrices
Wu X. (ed) Journal of the American Society for Information Science 49(5): 435-454, 1998. Type: Article
Jul 1 1998
Predictive data mining
Weiss S., Indurkhya N., Morgan Kaufmann Publishers Inc., San Francisco, CA, 1998. Type: Book (9781558604032)
Feb 1 1999
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy