Computing Reviews

Python for probability, statistics, and machine learning (2nd ed.)
Unpingco J., Springer International Publishing,New York, NY,2019. 384 pp.Type:Book
Date Reviewed: 10/15/20

The aim of this book is to offer programmers a tutorial on how to use Python libraries, like NumPy, Matplotlib, Pandas, SciPy, and SymPy, to perform probability evaluations and statistical analyses as the foundations for studying machine learning. It then gives a summary of various machine learning strategies and introduces some further libraries, such as TensorFlow, to support that work.

The book is divided into four parts. It starts with a brief introduction to some of the main mathematical/scientific Python libraries, the iPython shell, Jupyter notebooks, and various integrated development environments (IDEs). Next, it discusses random variables, various distributions, and sampling methods. The book then includes some Python statistical modules and gives a tutorial on how to test hypotheses, evaluate confidence levels, and perform linear regressions in Python. Finally, it gives an overview of machine learning concepts such as decision trees, neural networks, dimensionality reduction through principal component analysis (PCA), deep learning, the general steps in building a machine learning model with simple neural networks, and ways to train and test models.

The book uses Python 3.6 for its examples. The author expects the reader to understand simple loops, lists, matrix operations, Python math operations, input/output operations, and the module import system. Because Python 3 has good documentation online, it is not very hard to find explanations for any of these features that one does not know. The book introduces the use of more advanced Python libraries with small blocks of codes, brief comments, and sample output. Thus, the book is aimed primarily at intermediate or advanced Python programmers, although a beginner could struggle through it with some effort.

In terms of mathematics, because the author is trying to give a brief background on every probability and statistics concept that is touched upon, the book is dense with formulas. Because this is not an introduction to probability and statistics in general, there is limited space for giving examples of datasets and problem scenarios; therefore, the mathematics may not be easy to understand and apply if the reader is not very familiar with intermediate probability and statistics and linear algebra.

On the other hand, the book clearly defines each variable involved. It also uses clear variable names, increasing readability. And the author usually clarifies what the result of a calculation means, a salutary practice. However, some notation could be improved. For example, in the illustration of the chi-square distribution, after calculating the z-score, the author writes 1-stats.chi2(2).cdf(z). Based on the previous example and information, readers can infer that by writing 1-stats.chi2(2).cdf(z) instead of stats.chi2(2).cdf(z), the code is calculating the tail bound of the chi-square distribution; but this could be clarified with a comment. Moreover, because it passes two to the chi-square method, and there are three categories in the dataset, we might assume that two represents the degrees of freedom. But all of this would be easier for the reader to understand if the author had written something like this: “deg_of_freedom = num_categories - 1” and “1-stats.chi2(deg_of_freedom).cdf(z) # get the tail bound.”

In the machine learning section of the book, the author makes good use of visuals to explain machine learning concepts. The mathematical methods mentioned in this part are often not closely related to those from the previous chapters. Some general principles such as linear regression may apply to intermediate steps, but different libraries and methods are used. The beginner should read the (excellent) table of contents and directly study the methods needed; however, if the reader is expecting to learn how to integrate a machine learning study with various statistical methods, this book may not provide enough guidance.

This book needs better editing to avoid sentences like, “The pip installer does not check for such conflicts checks only if the proposed package already has its dependencies installed and will install them if not or remove existing incompatible modules” (p. 3). The author recommends dividing by “the fractional powers of 1/2” (p. 14) when he means “the powers of the fraction 1/2.” Later he says an increase in dimensionality would necessitate “1,000 more data points” instead of “1,000 times more data points” (p. 218). Such errors can seriously hamper a novice learner.

Nevertheless, this work is a generally sound and comprehensive overview of the areas it covers. We recommend it to Python programmers interested in growing in these areas or experts in these areas interested in learning how to deal with them in Python.

Reviewers:  Eugene Callahan, Yujia Zhang Review #: CR147083 (2103-0052)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy