This book is a comprehensive collection of data preprocessing techniques used in data mining. Any readers who practice data mining will find it beneficial, as it provides detailed descriptions of various data preprocessing techniques ranging from dealing with missing values and noisy data, to data reduction and discretization, to feature selection and instance selection.
Data mining plays an increasingly important role in today’s research, business, and society. Because the amount of data available is skyrocketing and we are all eager to figure out the meaning of this data, data mining becomes a critical tool to manage and use available data. However, data, especially that collected from real applications, is often incomplete, inaccurate in presentation, and often not suitable for direct use by a data mining process. This book surveys the technologies in data preprocessing methods that prepare the raw data for use by various data mining processes.
The book contains ten chapters. Chapter 1 introduces the concept of data preprocessing and its relation with data mining. Chapter 2 describes dataset properties and appropriate statistical tests for these properties. Chapter 3 establishes the basic models of data preparation, including integration, cleaning, and transformation. Chapters 4 through 9 are devoted to various data preprocessing techniques. They include dealing with missing values, dealing with noisy data, data reduction, feature selection, instance selection, and discretization. The last chapter is an overview of a data mining software package, knowledge extraction based on evolutionary learning (KEEL), that is widely used in data mining with rich data preprocessing features.
Each chapter in the book, especially the ones discussing specific areas of data preprocessing, is an independent module. Each one starts with an abstract and an introduction of the concepts, followed by a detailed description of the data preprocessing technology and the needed math tools. Each chapter ends with a comprehensive list of references, ranging anywhere from about 30 to over 170, which gives readers an excellent starting point if they would like to pursue the topics further.
This book is an excellent guideline in the topic of data preprocessing for data mining. It is suitable for both practitioners and researchers who would like to use datasets in their data mining projects.