Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Python data analytics
Nelli F., Apress, New York, NY, 2015. 364 pp. Type: Book (978-1-484209-59-2)
Date Reviewed: Jul 6 2016

The author starts by considering the choice of problem, team (preferably, as he says, interdisciplinary), and data sets, before going into data analysis in the narrow sense. I do hope readers don’t skip over this introduction in a hurry to get to “the meat” because a poor choice of data will be hard or impossible to fix later on.

Chapter 2 is a description of the Python stable, and a whistlestop, somewhat idiosyncratic tour of various development environments.

Chapter 3 then describes NumPy, which can be thought of as “Python meets numerical arrays.” The author has forgotten to explain what floating-point numbers are, or the difference between an int (implicit in chapter 2) and int32, which makes its appearance silently in chapter 3. Page 55 talks about matrix product, but actually means element-by-element multiplication and not matrix product as mathematicians mean it.

Chapter 4 introduces Pandas, the Python data analysis library, and its key data structures, the series and the DataFrame. While this chapter is almost all about data storage and access, there is a one-page digression into correlation and covariance, but with no explanation of what these mean.

The treatment of not a number (NaN) is confusing. While it is certainly true that NaN values are a problem, this is a problem intrinsic to data analysis, and many (including this reviewer) feel that NaN is a much neater solution than ad hoc encodings.

Chapter 5 describes the input/output of Pandas objects, from such useful sources as CSV files, HTML/XML files, Excel spreadsheets, JSON data, and SQL and NoSQL (MongoDB) sources. With JSON, the author describes the JSONviewer tool, which would certainly seem to be a useful addition to one’s toolbox. This chapter also describes interfaces to the HDF5 data format, and then, somewhat confusingly, introduces Python’s pickle (and CPickle) serialization tools before explaining that you don’t need them in Pandas because it has its own serialization.

Chapter 6 describes data manipulation. This is mostly unexceptionable, but there is one very misleading example. The author considers an outlier to be one that has an (absolute) value greater than three times the standard deviation. He does not include the mean in his definition at all, either in the text, or, worse, in the code.

Chapter 7 describes the matplotlib library, and states that much of the functionality is inspired by MATLAB. However, the author takes familiarity with MATLAB too far and forgets to explain (in a black-and-white book) that “ro” generates red circles or to give any list of the options.

Chapter 8 describes the scikit-learn machine learning library. The author is not to be blamed for the fact that the Python community has chosen to export (at least one version of) linear regression as part of the machine learning library, but I deeply regret that the author (admittedly following the scikit-learn documentation) has then approached regression via the training set/testing set methodology. The author then goes on to say, “However, a good indication of what prediction should be perfect is the variance,” and then looks at the variable linreg.score for the testing data only. We are presumably talking about the r2 concept here, though the author does not mention this, or describe it other than saying “the more the variance is close to 1 the more the prediction is perfect.” There is also no discussion of R2, which would be an excellent topic to mention, as it would introduce over-fitting, another topic not mentioned in this book. The discussion on support vector classifiers is marred by the fact that the description of regularization is meaningless.

Chapter 9 presents an interesting example, testing the question, “Does distance from the sea affect temperature?” and getting weather data from openweathermap.org, while making good pragmatic choices on deducing “distance from the sea.” Chapter 11 is also a good example.

The book is regrettably poorly edited, for grammar, for presentation, and for technical consistency (and accuracy). The presentation was clearly aimed at a color book, and the colored dots in Figures 8-2, and so on, are practically indistinguishable in black-and-white, whereas different shapes could have been used. In Figure 8-10, I cannot spot the “blue dot in the red portion.”

Would I recommend this book? As a textbook, no. To someone who wants to learn how to draw valid inferences from data, no, as there is essentially no discussion of hypothesis testing, over-fitting, or the curse of dimensionality. To the new data analyst who complains, “At college, we were given the data and told to produce statistical conclusions; now I am told to find the data, or am given a mixture of Excel spreadsheets and web addresses, and have to produce lots of pretty pictures,” yes: here is a decent introduction to a powerful manipulation toolbox. But I would also suggest a good Python book.

More reviews about this item: Amazon, Goodreads

Reviewer:  J. H. Davenport Review #: CR144549 (1609-0636)
Bookmark and Share
  Featured Reviewer  
 
Data Mining (H.2.8 ... )
 
 
Python (D.3.2 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Data Mining": Date
Feature selection and effective classifiers
Deogun J. (ed), Choubey S., Raghavan V. (ed), Sever H. (ed) Journal of the American Society for Information Science 49(5): 423-434, 1998. Type: Article
May 1 1999
Rule induction with extension matrices
Wu X. (ed) Journal of the American Society for Information Science 49(5): 435-454, 1998. Type: Article
Jul 1 1998
Predictive data mining
Weiss S., Indurkhya N., Morgan Kaufmann Publishers Inc., San Francisco, CA, 1998. Type: Book (9781558604032)
Feb 1 1999
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy