As Hal Varian, Google’s renowned chief economist, says,

The sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s? The ability to take data--to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it--that’s going to be a hugely important skill in the next decades. [1]

This is precisely the domain of data mining or, using the current catchphrase, data science.

Unlike other textbooks on data mining techniques, the authors deliberately avoided an algorithm-centered approach. Their book is not a replacement for a more thorough textbook, yet it provides value to data mining practitioners and so-called data scientists. You will not find detailed information on particular big data technologies either, such as Hadoop/MapReduce and NoSQL databases. “Chemistry is not about test tubes,” and data science is not about the particular tools needed to do the job in practice. A book on such tools becomes obsolete as it hits the presses; here, the authors have tried to focus on the principles that should guide data scientists.

The first chapters describe the kind of problems data science tries to solve, with interesting case studies that illustrate the importance of data as a strategic asset for a company. The data mining process is described using the cross industry standard process for data mining (CRISP-DM) as a framework, and the multidisciplinary nature of data mining is emphasized. Up to this point, apart from the interest of the particular examples, nothing unusual for a data mining textbook.

The distinctive approach followed by the authors to appeal to a wider audience, including business people involved in decision making, starts in the discussion of particular data mining techniques. Supervised segmentation used to select the most informative attributes serves as the perfect prelude to decision tree induction algorithms. Mathematics is kept to a minimum, and the somewhat verbose descriptions required to avoid algorithm pseudocode gently introduce key data mining techniques. Linear classifiers, support vector machines, and logistic regression are also described from an intuitive point of view, as nearest neighbor classifiers, *k*-means, and hierarchical clustering will be in a later chapter.

In a business context, the evaluation of results is key. Hence, several chapters are devoted to the topic. The first of them focuses on overfitting (that is, incorporating into the models features specific to the training set that do not generalize well beyond that training set). Keeping separate test sets (also known as holdout data), fitting graphs, and learning curves are clearly motivated and analyzed. Mechanisms for avoiding overfitting, such as decision tree pruning and regularization, are also introduced. In a different chapter, the authors present expected value as the evaluation framework of choice for business problems, where the monetary value of each outcome can be estimated and used to evaluate the results of predictive models. Another chapter on the visualization of model performance serves as a nice survey of profit, return on capital (ROC), cumulative response, and lift curves.

A couple of chapters in the last third of the book describe two important areas that are not covered in the previous chapters: probabilistic reasoning using Bayes’ theorem (as used by the naive Bayes classifier), where readers will discover that the probability of liking Sheldon Cooper on Facebook is about 30 percent higher for high-IQ people, and text mining, where readers will find the connection of the inverse document frequency (IDF) used in information retrieval to the entropy used for building decision tree classifiers. In another of the final chapters, readers will also get a glimpse of other data mining tasks and techniques, such as association rules, link prediction, data reduction, latent information models, and ensemble models, which are justified recurring to the bias-variance decomposition of error.

Obviously, readers cannot expect a thorough treatment of all the aforementioned topics in a 350-page book, not even in more technical textbooks that triple that page count. The approach followed by the authors also lengthens some discussions in order to avoid being too algorithmically oriented. This will benefit business people who want to become acquainted with the techniques behind data mining without being drowned in technical details. Computer scientists, as aspiring data scientists, can still benefit from not being exclusively focused on algorithms, a common malady in many data mining monographs, and getting the big picture of the business problems that data mining technologies try to solve. The expected value framework and the different visualization techniques at their disposal are key ingredients to enable their communication with those responsible for making decisions in a business context.

More reviews about this item: Amazon, Goodreads, B&N