Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Mining imperfect data: with examples in R and Python (2nd ed.)
Pearson R., SIAM, Philadelphia, PA, 2020. 184 pp.  Type: Book (978-1-611976-26-7)
Date Reviewed: Nov 19 2021

Data analysis--including both data mining and machine learning--has made a lot of progress in the past decade. Both the R and Python programming languages have been used to analyze data. The Achilles’ heel in this task is the handling of imperfect data. This second edition is a timely publication, covering all the important issues in data analysis.

The book’s ten chapters cover topics such as the characterization of imperfect data, dealing with univariate outliers, multivariate outliers, time series outliers, dealing with missing data, other data anomalies such as metadata errors, uninformative variables, sensitive analysis, sampling schemes for a fixed dataset, and good data characterization. All the chapters are peppered with R and Python programs and illustrated with many examples from practical applications using either R or Python code.

The introductory chapter sets the stage by listing the ten types of data imperfection considered in the book. This chapter also describes the sources of data imperfection. Further data exchange formats are described here. The second chapter, on univariate outliers, describes four different outlier models and discusses outlier resistant procedures. This chapter also includes outlier detection procedures. Case studies are presented.

The third chapter, on multivariate outliers, starts with multivariate statistics, correlations, and covariance, including classical covariance and Mahalanobis distances. Robust estimation procedures are discussed next. Distance and density-based procedures are also expounded. As usual, a case study is presented in detail along with programs and graphs. Chapter 4 is on dealing with time series outliers. Four sample problems are presented first, to give a context, followed by a discussion of the nature of time series outliers. The chapter covers appropriate models and different filters for cleaning, and applications are worked out in detail.

The fifth chapter discusses how to deal with missing data. After discussing missing data representations, two missing data examples are presented. Missing data sources, types, and patterns are explained, as well as simple treatment strategies. The EM algorithm is explained and examples are provided. Chapter 6 describes other data anomalies: inliers, misaligned data, thin levels of categorical data, metadata errors, data omissions, and duplicate records.

Chapter 7’s discussion of sensitivity analysis includes the general framework along with some specific recommendations. Chapter 8 gives a detailed description of different sampling strategies; their advantages and disadvantages are provided. The chapter also includes a number of examples for different sampling strategies. Chapter 9 characterizes “good” data using inequalities. Some of them could be conflicting with each other, and the author recommends how to do a balanced approach. The concluding chapter covers what is new in this second edition.

I enjoyed reading this book and I am confident that practitioners will find it very useful. The only drawback is the lack of exercises for students, who may want to check whether they have understood the material.

Reviewer:  M. S. Krishnamoorthy Review #: CR147385
Bookmark and Share
  Reviewer Selected
Python (D.3.2 ... )
Data Mining (H.2.8 ... )
Would you recommend this review?
Other reviews under "Python": Date
 Natural language processing recipes: unlocking text data with machine learning and deep learning using Python
Kulkarni A., Shivananda A.,  Apress, New York, NY, 2019. 260 pp. Type: Book (978-1-484242-66-7)
Nov 26 2021
 Practical natural language processing with Python: with case studies from industries using text data at scale
Sri M.,  Apress, New York, NY, 2021. 272 pp. Type: Book (978-1-484262-45-0)
Nov 9 2021
Learning scientific programming with Python (2nd ed.)
Hill C.,  Cambridge University Press, Cambridge, UK, 2020. 570 pp. Type: Book (978-1-108745-91-8)
Nov 2 2021

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright © 2000-2021 ThinkLoud, Inc.
Terms of Use
| Privacy Policy