Datasets with high dimensions have an inherent problem popularly known as the “curse of dimensionality.” As such, clustering algorithms perform poorly when the number of attributes increases. When the dataset is large, the task of computing the clusters becomes computationally expensive.
In this paper, the authors attempt to solve both of these problems, while using the k-nearest neighbors clustering algorithm. They propose the use of a memory-efficient data structure (KD tree) in the MapReduce computing paradigm framework. The MapReduce framework is convenient in many ways; as the data and the computing nodes scale, managing this change is almost effortless.
The KD tree is constructed as a balanced tree (using singular value decomposition (SVD)) and only the leaf nodes store the data points. This makes the tree memory efficient even for large datasets. Due to this, duplicating this data structure across different computing nodes is feasible, enabling parallel computations. The k-nearest neighbors algorithm is then executed in parallel across these computing nodes.
Using the distributed index for matching multimedia objects (DIMO) system, the authors are able to demonstrate a 20 percent improvement in precision over the systems that currently exist in the literature.