Data stream mining has become an important phenomenon with new technologies ranging from patient tracking to stock market investing. Data streams contain data items in temporal order and are potentially endless. Efficient and effective mining of them is important because mining takes place online. This paper is a survey of preprocessing for data stream mining.
In the paper, the authors first present fundamental concepts such as concept drift related to data streams. They emphasize principles of proper experiment design in this machine learning domain. Then, they present the important preprocessing aspects of data mining: data reduction in terms of dimensionality reduction, like elimination of redundant features; instance reduction, like reducing the number of training instances; and feature space simplification, like discretization. They analyze the leading algorithms in terms of their predictive, reduction, time, and memory performance. The experiments contain 20 datasets, of which 13 are real.
The survey is comprehensive and the future research pointers are good. The study is well done and will be useful both to practitioners and researchers. It is a noticeable addition to the literature of a research area that is in its infancy.