Data analysis--including both data mining and machine learning--has made a lot of progress in the past decade. Both the R and Python programming languages have been used to analyze data. The Achilles’ heel in this task is the handling of imperfect data. This second edition is a timely publication, covering all the important issues in data analysis.
The book’s ten chapters cover topics such as the characterization of imperfect data, dealing with univariate outliers, multivariate outliers, time series outliers, dealing with missing data, other data anomalies such as metadata errors, uninformative variables, sensitive analysis, sampling schemes for a fixed dataset, and good data characterization. All the chapters are peppered with R and Python programs and illustrated with many examples from practical applications using either R or Python code.
The introductory chapter sets the stage by listing the ten types of data imperfection considered in the book. This chapter also describes the sources of data imperfection. Further data exchange formats are described here. The second chapter, on univariate outliers, describes four different outlier models and discusses outlier resistant procedures. This chapter also includes outlier detection procedures. Case studies are presented.
The third chapter, on multivariate outliers, starts with multivariate statistics, correlations, and covariance, including classical covariance and Mahalanobis distances. Robust estimation procedures are discussed next. Distance and density-based procedures are also expounded. As usual, a case study is presented in detail along with programs and graphs. Chapter 4 is on dealing with time series outliers. Four sample problems are presented first, to give a context, followed by a discussion of the nature of time series outliers. The chapter covers appropriate models and different filters for cleaning, and applications are worked out in detail.
The fifth chapter discusses how to deal with missing data. After discussing missing data representations, two missing data examples are presented. Missing data sources, types, and patterns are explained, as well as simple treatment strategies. The EM algorithm is explained and examples are provided. Chapter 6 describes other data anomalies: inliers, misaligned data, thin levels of categorical data, metadata errors, data omissions, and duplicate records.
Chapter 7’s discussion of sensitivity analysis includes the general framework along with some specific recommendations. Chapter 8 gives a detailed description of different sampling strategies; their advantages and disadvantages are provided. The chapter also includes a number of examples for different sampling strategies. Chapter 9 characterizes “good” data using inequalities. Some of them could be conflicting with each other, and the author recommends how to do a balanced approach. The concluding chapter covers what is new in this second edition.
I enjoyed reading this book and I am confident that practitioners will find it very useful. The only drawback is the lack of exercises for students, who may want to check whether they have understood the material.