An outlier is “an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” . In data mining, they are usually called anomalies, although they are also referred to as abnormalities, discordants, or deviants in statistics. Their detection is especially difficult since they are often hard to distinguish from noise. Actually, some authors classify outliers into weak and strong outliers just to stress the difference between noise, the former, and anomalies, the latter.
While noise identification and removal is an important problem in many application domains, such as signal processing, data mining efforts are typically focused on anomaly detection. Since their differentiation is merely semantic, many of the techniques proposed for one problem can be used for solving the other. The different approaches that have been used for solving the problem from different perspectives have led to a wide gamut of techniques, which do not always employ a consistent terminology since they do not share the same background. Fortunately, Aggarwal has written a thorough monograph that surveys the broad field of outlier analysis (or anomaly detection, if you prefer). His volume seamlessly integrates the traditional content from statistics textbooks and the latest developments in data mining, providing a balanced review of the many algorithms that have been proposed in the literature.
After the usual introductory chapter, which provides a bird’s-eye view of the field, the first half of the book covers the different techniques and models that have been devised for the detection of outliers. Starting with extreme value analysis, the branch of statistics that deals with extreme deviations from the median of probability distributions, the author delves into probabilistic models whose parameters can be learned by expectation maximization (EM) algorithms. Linear models, which analyze linear correlations, are then addressed, including linear regression and principal component analysis (PCA). Later, proximity-based outlier detection techniques are analyzed. These encompass both distance-based methods, such as those based on nearest neighbors, and density-based methods, whose origin can be traced to the density-based clustering techniques often used in data mining. In fact, outliers can be viewed as the byproduct of unsupervised clustering techniques: anomalies (or noise) are what is not included in the identified clusters. Subspace clustering is another common data mining technique, designed for dealing with high-dimensional data, yet anomaly detection poses some specific challenges and requires something more than the blind adaptation of existing clustering techniques. Of course, Aggarwal delves into all the necessary details and describes some subspace outlier detection techniques. His in-depth survey of outlier detection techniques ends with a chapter on supervised outlier detection. From the perspective of supervised machine learning, outlier detection is just a classification problem, yet a highly unbalanced one with its own nuances in practice.
The survey of outlier detection models, described in terms of multidimensional numerical data in the first half of the book, is complemented by a review of anomaly detection techniques for different data types in the second half. Aggarwal devotes chapters to categorical, text, and mixed attribute data, as well as different situations where data values have dependencies and hence cannot be treated independently from one another. These situations range from time series and data streams, spatial and spatiotemporal data, to discrete sequences, graphs, and networks. Specific techniques for each of them are treated and a wealth of references is provided in the author’s insightful bibliographic comments at the end of each chapter.
With the thoroughness that characterizes its surveys of techniques and adaptations to different data types, the book ends with a chapter on the applications of outlier analysis. Many application domains are mentioned, from quality control and fault detection to fraud detection and intrusion detection systems. The author provides a brief description of specific problems in each application domain, with discussions of how the different techniques covered in the previous chapters can be used to solve them in practice. His abundant bibliographic references can serve as a good starting point for those interested in particular applications.
Aggarwal has written a complete survey of the state of the art in anomaly detection. His writing style is not as dry as you might expect from a thorough academic survey of a similar scope, the details behind different outlier detection techniques are clearly explained, and his comments when comparing and contrasting different approaches are often insightful. His book provides a solid frame of reference for those interested in anomaly detection, both researchers and practitioners, no matter whether they are generalists or they are mostly focused on particular applications. All of them can benefit from the broad overview of the field, the nice introductions to many different techniques, and the annotated pointers for further reading that this book provides.