Computing Reviews, the leading online review service for computing literature.

Search

Mining imperfect data: with examples in R and Python (2nd ed.)
Pearson R., SIAM, Philadelphia, PA, 2020. 184 pp. Type: Book (978-1-611976-26-7)

Date Reviewed: Nov 19 2021

Data analysis--including both data mining and machine learning--has made a lot of progress in the past decade. Both the R and Python programming languages have been used to analyze data. The Achilles’ heel in this task is the handling of imperfect data. This second edition is a timely publication, covering all the important issues in data analysis. The book’s ten chapters cover topics such as the characterization of imperfect data, dealing with univariate outliers, multivariate outliers, time series outliers, dealing with missing data, other data anomalies such as metadata errors, uninformative variables, sensitive analysis, sampling schemes for a fixed dataset, and good data characterization. All the chapters are peppered with R and Python programs and illustrated with many examples from practical applications using either R or Python code. The introductory chapter sets the stage by listing the ten types of data imperfection considered in the book. This chapter also describes the sources of data imperfection. Further data exchange formats are described here. The second chapter, on univariate outliers, describes four different outlier models and discusses outlier resistant procedures. This chapter also includes outlier detection procedures. Case studies are presented. The third chapter, on multivariate outliers, starts with multivariate statistics, correlations, and covariance, including classical covariance and Mahalanobis distances. Robust estimation procedures are discussed next. Distance and density-based procedures are also expounded. As usual, a case study is presented in detail along with programs and graphs. Chapter 4 is on dealing with time series outliers. Four sample problems are presented first, to give a context, followed by a discussion of the nature of time series outliers. The chapter covers appropriate models and different filters for cleaning, and applications are worked out in detail. The fifth chapter discusses how to deal with missing data. After discussing missing data representations, two missing data examples are presented. Missing data sources, types, and patterns are explained, as well as simple treatment strategies. The EM algorithm is explained and examples are provided. Chapter 6 describes other data anomalies: inliers, misaligned data, thin levels of categorical data, metadata errors, data omissions, and duplicate records. Chapter 7’s discussion of sensitivity analysis includes the general framework along with some specific recommendations. Chapter 8 gives a detailed description of different sampling strategies; their advantages and disadvantages are provided. The chapter also includes a number of examples for different sampling strategies. Chapter 9 characterizes “good” data using inequalities. Some of them could be conflicting with each other, and the author recommends how to do a balanced approach. The concluding chapter covers what is new in this second edition. I enjoyed reading this book and I am confident that practitioners will find it very useful. The only drawback is the lack of exercises for students, who may want to check whether they have understood the material.

Reviewer: M. S. Krishnamoorthy	Review #: CR147385

Python (D.3.2 ... )

Data Mining (H.2.8 ... )

Would you recommend this review?

yes

Other reviews under "Python":	Date

Practical Python Hetland M., APress, LP, 2002. 648, Type: Book (9781590590065)	Mar 28 2003

Python programming: an introduction to computer science Zelle J., Franklin B, 2003. Type: Book (9781887902991)	Dec 2 2004

Foundations of Python network programming Goerzen J., APress, LP, Berkeley, CA, 2004. 512, Type: Book (9781590593714)	Dec 26 2004

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy