Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Doing data science : straight talk from the frontline
Schutt R., O’Neil C., O’Reilly Media, Inc., Sebastopol, CA, 2013. 406 pp. Type: Book (978-1-449358-65-5)
Date Reviewed: May 16 2014

Business hype around data science, and more recently, the data science profession, has done surprisingly little to close the gap in understanding how data science is done and the persona that drives its application. For this reason, I applaud the primary motivation of O’Neil and Schutt’s book, which is to describe the doing of data science, and the very human starting point.

The authors proclaim outright that their work is not just another machine learning book (indeed, there are plenty of those). Instead, their objective is to describe the process of doing data science in layman’s terms. It is a good read for any student, or any business person who wants to understand the mechanics and variables behind the hype. The persona of the data scientist is central to how data science is done, and fittingly, the first chapter of the book is devoted to the subject.

O’Neil and Schutt are clear in their character development. Data scientists are individuals with very strong mathematical and analytical skills who are highly comfortable working with computers, and who are also highly savvy. They intuitively grasp computing power as an economic force versus purely a technical one. They also have at least basic programming skills and can wear different hats across hardware and software. To summarize, they have a holistic ability to think about and manage data, combined with strong business issue lenses.

O’Neil and Schutt also capture nuances that are so important to understanding how data science is done. Increasingly, data science projects are conducted by multidisciplinary teams. Collaboration is critical, and how to build an efficient data science team is in and of itself a compelling subject, which deserves to be part of a data science curriculum.

The book’s title led me to expect industrial strength, yet down-to-earth, real-world examples of data science collaboration in practice. The reader will enjoy these examples, but given that the content is based on a Columbia University course, with contributions from many other practitioners who are essentially guest authors, what is primary here is principal concept illustration (including code in R, bash, and Python).

This book successfully introduces foundational algorithms and the basics of modeling. It is nice to see an attempt to explain the mathematical foundations behind naive Bayes or logistic regression; however, Schutt and O’Neil don’t attempt to maintain an academic level or rigor. Instead, they offer a relaxed and refreshing overview, which includes a useful discussion of statistical inference within the big data context (chapter 2); John Tukey’s exploratory data analysis, which provides a nice alternative to the formal method by appealing to our cognitive power of image recognition (data visualization and social networks are discussed in chapters 9 and 10); and the vintage limitation of the correlation approach and the related need of causal study (chapter 11). Unfortunately, the few explanations related to ROC curves are vastly confusing. This is where an absence of rigor is perhaps inadvisable, though it is an exception. Chapters 6, 8, and 10 are most original and give practical pointers on data science in financial modeling, getting insights from data in general, and social networks.

The remaining chapters of the book are devoted to engineering aspects of the process and operation of data science, as well as some coverage of the futurisms of full automation or robotic statisticians, indeed a lovely subject to discuss with students, however enormous the consequences, and one that asks provoking questions (chapters 15 and 16).

I find the presentation style of this information easy to understand and therefore adequate for the goal, that is, an appropriate beginning point for a student who desires to pursue a career in data science for industry. For those who are mostly interested in devising new algorithms and heuristics, it may seem lightweight; however, this depth should not be expected. As far as I know, a practical and in-depth course on data science is yet to be developed. There are some good attempts out there, including introductions to machine learning, for example, Andrew Ng’s popular course on the Coursera MOOC platform. Data science, however, is much more than machine learning.

In conclusion, I circle back to an analogy of the best chess strategy: human-computer teams are superior to humans or machines alone as long as the data scientist brings the required persona to the table (her or him!). The reader of the book will receive enough information to continue and be optimistic about it.

More reviews about this item: Amazon, Goodreads

Reviewer:  Serge Berger Review #: CR142290 (1408-0621)
Bookmark and Share
  Reviewer Selected
 
 
General (H.0 )
 
 
Learning (I.2.6 )
 
Would you recommend this review?
yes
no
Other reviews under "General": Date
Introduction to database and knowledge-base systems
Krishna S., World Scientific Publishing Co., Inc., River Edge, NJ, 1992. Type: Book (9789810206192)
Nov 1 1993
An introduction to information science
Flynn R., Marcel Dekker, Inc., New York, NY, 1987. Type: Book (9789780824775087)
Apr 1 1988
Acta Informatica 19, 4 (Sept. 1983)
  Acta Informatica 44:1983. Type: Journal
Mar 1 1986
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy