According to McKinsey, a management consulting firm, “big-data analytics could increase annual GDP in retailing and manufacturing by up to $325 billion and save as much as $285 billion in the cost of healthcare and government services” . However, the shortage of skilled data scientists could limit its adoption across different economic sectors and prevent those benefits from being reaped. Hence, a book on “becoming a data scientist” offers an attractive and well-paid opportunity to many statisticians, computer scientists, and business analysts.
Vincent Granville, a well-known data scientist himself, has collected some of his blog entries and turned them into a book. Unfortunately, “turning a vast amount of valuable but unstructured and scattered text,” which was previously published online, into a valuable book is not an easy endeavor. His resulting book lacks the organization of a good textbook and the logical progression you expect from a well-planned monograph. Occasional detours and some repetitions are unavoidable when recycling blog posts without rewriting them from scratch. Despite its online origin, the book still contains some valuable nuggets and might be of interest to practicing or aspiring data scientists once they are acquainted with their specialty.
The author starts by describing what data science is and, maybe more important, what it is not by including examples of “fake” data science, such as statistical material bundled as “big data” or “data science.” He describes 13 real-world scenarios where a true data scientist might add value to his employers and customers. As Jerry Weinberg used to say when talking about the role of consultants, it is always a people problem. Data scientists must bridge the gap between statisticians, computer scientists, analysts, and other business stakeholders, which requires from them an uncommon combination of technical and business skills.
Along with some features that make “big data” different from conventional statistics (such as the triple V: volume, variety, and velocity), you will also find somewhat odd sections, including the transcript of a Q&A session, Excel tips, information on how to produce videos with R, or a description of how to use the digits of pi in a random number generator. Granville proposes a peculiar taxonomy that classifies data scientists into fake, self-made, amateur, and extreme data scientists, describing each category with its prototypical features and peculiarities. He also offers some tips for independent consultants and entrepreneurs, based on his own experience, ranging from sample project proposals to startup ideas for data scientists.
Apart from rambles on data science, what makes a true data scientist, and many other related topics, the book also contains some interesting technical material. Deviating from conventional statistics, he advocates for the use of model-free confidence intervals (supported by his own First Analyticbridge Theorem), a bumpiness measure for time series, a synthetic variance that avoids the numeric instability of its traditional definition (an important issue when computing it for huge datasets), and a novel rank correlation coefficient (which set the stage for the first theoretical data science competition at the author’s Data Science Central website). You will also find a useful linear-time clustering algorithm for the creation of taxonomies, and heuristic guidelines on how to determine the optimal number of clusters by finding the elbow at the curve that displays the percentage of variation explained, rather than the more common sum of square errors (SSE).
As this is a 300-page book, readers will be surprised by the huge amount of topics that the author addresses, from opinionated rants on current buzzwords to insightful comments on particular application domains, with a strong focus on the author’s pet peeves (namely, click fraud detection and online marketing). Popular applications, such as stock trading, and rare ones, such as meteorite hits, all receive their own share of attention in Granville’s discussions.
The book ends with a couple of chapters with useful information for aspiring data scientists, from a (biased) set of job interview questions to desirable skills, hiring issues, job titles, and a salary survey. The author also provides links to professional resources, a list of companies hiring data scientists, job ads, and sample resumes.
If you are looking for a well-organized book that tries to create some order within the buzzword-driven data science field, this book is probably not the right place to start your search. However, you can still find some valuable information in this book, which is built from “unstructured and scattered text” (page xi).
More reviews about this item: Amazon, Goodreads