Computing Reviews, the leading online review service for computing literature.

Search

Predictive analytics and data mining : concepts and practice with RapidMiner
Kotu V., Deshpande B., Morgan Kaufmann Publishers Inc., San Francisco, CA, 2014. 446 pp. Type: Book (978-0-128014-60-8)

Date Reviewed: Nov 30 2015

As organizations are flooded with data, data analytics has emerged as an essential task for gaining insights for data-driven decision making. The popularity and importance of data analytics and data mining have been demonstrated by the establishment of a separate discipline of data science in many universities. Data science trains students with interdisciplinary knowledge and skills drawn from statistics, mathematics, computer science, and domain knowledge to manage the large-scale raw data and to manipulate them to mine useful and actionable knowledge. At its heart lies data analytics. The authors of the book explore different predictive analytics methodologies and how to implement them using the RapidMiner tool. It includes algorithms for classification, regression, association analysis, clustering, time series forecasting, anomaly detection, and so on. It also includes different data exploration techniques and feature selection approaches that cut across many predictive analytics approaches. The authors organized the presentation of each topic in three steps. First, they present each analytics method with an intuitive introduction and advice on when to use it. Second, they explain the essential details of how each analytics method works, demonstrating the method using an example data set. Finally, they present how the specific analytics can be implemented in RapidMiner, followed by short discussions of the output and how to interpret it. Often books on a specific analytics tool or system sound like a tutorial or manual on how to use the tool, without much on the methodology or theory behind it, while other books go to the other extreme, focusing mainly on the theory. The authors have done an excellent job at balancing theory and application; thus, the book can serve as a great introduction to newcomers to data analytics and as a good review and introduction to RapidMiner for analytics professionals. The chapter on classifiers covers decision trees, rule induction, k-nearest neighbors (k-NN), naïve Bayes, artificial neural networks, support vector machines, and ensemble learners. The authors explain the necessary concepts such as entropy, information gain, gain ratio, different similarity measures, proximity between records, prior/posterior and conditional probabilities, hidden layers and back propagation of error, boundary fitting and margin, meta-learning using voting, bagging (bootstrapping with aggregation), boosting, and random forest. The chapter on regression methods focuses on the linear regression model among numeric variables and the logistics regression model for categorical predictions. The chapter on association analytics explains how to mine association rules from frequently co-occurring item sets, using the concepts of support, confidence, lift, and conviction. Two efficient algorithms, the apriori algorithm and the frequent pattern algorithm, are illustrated. Clustering analysis is used for describing a given data set, such as customer groups for market segmentation, document clustering by topic, or clickstream pattern groups, or for pre-processing for other predictive analysis methods such as dimensionality reduction or object reduction. It covers the k-means clustering method that uses the centroids and distance measures, density-based spatial clustering of applications with noise (DBSCAN) that uses the density on spatial distribution, and self-organizing maps, which generates a two-dimensional grid with similar objects placed next to each other. The chapters on text mining, time series forecasting, and anomaly detection apply the data classification clustering methods to specific data types. The text mining chapter explains how to transform text into more structured data and then convert it into a matrix representation using term frequency-inverse document frequency (TF-IDF) to apply the classification or clustering algorithms. The time series forecasting methods use historical data on one variable to predict the next value on the same variable. The prediction can be achieved by data-driven approaches for local patterns or model-driven forecasting methods such as linear or auto-regression to predict the global patterns. Anomaly detection is used to identify outliers in data, using statistical distribution, distance-based clustering, or density-based k-NN classification. The chapter on data exploration covers how to prepare data for data mining through descriptive statistics and data visualization. The chapter on feature selection discusses principal component analysis, information gain-based feature selection for numerical data, chi-square-based filtering for categorical data, and regression methods to select features or to eliminate features one at a time. The chapter on model evaluation presents different performance measures for machine learning algorithms and predictions including precision, recall, accuracy, receiver operator characteristic (ROC), area under the curve (AUC), and lift curves. RapidMiner uses graphical process design for data analytics, which only requires users to specify parameters for different machine learning algorithms. Professionals who need to quickly acquire machine learning methods and get fast results will be happy with this book. In summary, this book is an excellent introductory data science textbook to expose students to the essential concepts in predictive analytics. For the seasoned professional, it can serve as a handy reference book to choose the best predictive analytics tool for a given data set. More reviews about this item: Amazon

Reviewer: Soon Ae Chun	Review #: CR143976 (1602-0098)

Data Mining (H.2.8 ... )

Content Analysis And Indexing (H.3.1 )

Reference (A.2 )

Would you recommend this review?

yes

Other reviews under "Data Mining":	Date

Feature selection and effective classifiers Deogun J. (ed), Choubey S., Raghavan V. (ed), Sever H. (ed) Journal of the American Society for Information Science 49(5): 423-434, 1998. Type: Article	May 1 1999

Rule induction with extension matrices Wu X. (ed) Journal of the American Society for Information Science 49(5): 435-454, 1998. Type: Article	Jul 1 1998

Predictive data mining Weiss S., Indurkhya N., Morgan Kaufmann Publishers Inc., San Francisco, CA, 1998. Type: Book (9781558604032)	Feb 1 1999

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy