Classification in artificial intelligence aims to separate data into categories using available variables. High dimensionality and inherent variability in medical datasets have been challenges to classification tasks in this field.
In this paper, a hierarchical classifier is proposed to solve a multi-class classification problem, the identification of colorectal cancer (CRC) and other non-malignant conditions. First, a feature selection process is proposed, which ranks the variables in the training set using a support vector machine - recursive feature elimination (SVM-RFE) algorithm. Seventy out of 12,341 features were selected as the most relevant ones. Then, a binary SVM model with ten-fold cross validation is built for the separation of the cancer and non-cancer classes. The samples in the non-cancer class are further used in a one-class SVM that identifies patients bearing colorectal adenomas or other findings. Two datasets were used in this research, undiluted and diluted excitation emission matrix (EEM) human plasma samples, both from Lawaetz et al.’s study [1]; 70 percent of each were used in training and the rest for testing. Results using this method on the undiluted data show significant performance improvement, while results on the diluted data show less.
The contributions of this paper are at least two-fold: 1) a hierarchical classifier with two SVM models that separates CRC and other non-malignant conditions and shows greater performance gain; and 2) a ranking feature selection process that largely reduces the dimension of the variables, from 12,341 to 70. In addition, the time cost is a one-time eight-hour cost prior to model building.