In this information age, the volume and complexity of data generated and collected is increasing exponentially. In order to significantly use these huge volumes of complex data, we must have fast and automated techniques to discover useful knowledge in them.
Data mining is one of the most popular and effective tools for knowledge discovery. But current data mining techniques are not fast and efficient enough to be of any significant use when dealing with high volumes of complex data. The current techniques are also not good enough to discover knowledge in real time, as in when the information is generated and gathered. Modern knowledge discovery applications should be able to mine knowledge on the fly, to be useful for complex applications such as medical diagnoses, weather prediction, web analytics, and satellite image processing, where real-time information can dramatically improve the quality of decision making.
One of the most important steps in data mining involves clustering, the process of finding groups of data objects that are similar to one another. Clustering is useful for exploring data, and it can help to identify homogeneous data groups. It can also be used to detect anomalies or outliers that do not fit well into any cluster.
In this book, the authors describe three fast and scalable algorithms they developed for the clustering of high-volume and high-dimensional data: Halite, BoW (best of both worlds), and QMAS (querying, mining, and summarization of multimodal databases). The book first discusses and compares the existing clustering algorithms and their limitations when dealing with high-dimensional data.
Halite is a fast and scalable density-based clustering algorithm for data with moderate to high dimensionality, which can analyze large collections of complex data elements. The BoW algorithm focuses on the problem of finding clusters in terabytes of moderate- to high-dimensional data. It uses parallel clustering and sample-and-ignore (SnI) techniques to balance the costs of disk accesses and network accesses, and also to achieve a tradeoff between these two potential bottlenecks. QMAS is a fast and scalable approach for low-labor labeling (finding suitable labels for a large image collection that has only a few images labeled), and mining and attention routing (finding clusters, outlier images, and representative images).
These techniques are much faster than the existing techniques when dealing with high volumes of complex data. While most other techniques focus on one aspect, either data size or complexity, the proposed techniques take both factors into account, thereby enabling real-time data mining on large and complex sets of data.
The book discusses the methodology, working principles, algorithms, implementation and performance details, experimental results, and advantages of the three techniques in detail, using real-world examples. Figures, graphs, and charts are used to clearly illustrate the processes and results. The authors share the results of the real-world evaluation of their techniques on very large datasets with billions of complex data elements. This evaluation shows that the three techniques always deliver accurate results much faster than the existing techniques. The applications used for evaluation include automatic breast cancer diagnosis, satellite imagery analysis, and graph mining, both on a large web graph crawled by Yahoo! and on a graph of all Twitter users and their connections.
This book is a must-read for all data mining professionals, as it explains new and superior techniques for clustering large datasets of high-dimensional data. It would also be interesting for professionals who work with large volumes of complex data and want real-time information for better decision making.