This paper aims to improve the performance of the k-nearest neighbors (kNN) approach and its variants in three data mining applications--classification, regression, and missing data imputation--by proposing an approach called the correlation matrix kNN (CM-kNN). The paper is well organized and easy to follow. The motivation is clear and the proposed approach has a sound mathematics derivation.
The standard kNN approach uses the same value of k for all the test data points, which usually leads to low prediction accuracy in classification applications. In contrast, it is interesting that in this paper, CM-kNN uses different values of k for different test data points, and reconstructs the test data points from training data points to learn a correlation matrix between the training data points and test data points; various techniques have been applied to improve the accuracy.
The least-squares error is used as the loss function to minimize the reconstruction error for reconstructing each test data point from all training data points. An l1-norm regularization term is added to the loss function in order to maintain the sparsity in the reconstruction process. In addition, an l2,1-norm regularization term is added to the loss function to remove the impact of noisy training data points, which are irrelevant to all test data points in the reconstruction process. This also helps to maintain sparsity in the reconstruction process. Moreover, a graph Laplacian-based LPP regularization term is added to the loss function to preserve the local consistency of structures of the training data points during the reconstruction process. Although the utilities of l1-norm, l2,1-norm, and LPP norm have been proven in previous work, their use in the context of the paper is novel. Finally, for the loss function, they propose an optimization based on the iterative reweighted least-squares optimization method, and also a corresponding algorithm; then, they prove the correctness of the algorithm.
Extensive experiments have been conducted on ten different types of data on the three data mining applications mentioned above. The conducted experiments are sufficient and the results are persuasive. In particular, missing data is randomly selected for a missing data imputation experiment. Prediction accuracy is used as the performance measure of classification, and correlation coefficient and root-mean-square error are used as the performance measures of regression and missing data imputation.