An important objective in clinical oncology is to rapidly determine the type of tumor a patient has, and to provide a prediction of tumor behavior in order to identify the most appropriate treatment. Traditionally, this has been done by pathologists, who look at the type of cells and other morphological features present in the tumor and surrounding tissues. The availability of microarray technology for measuring the RNA expression levels for every gene in the genome has opened the door to characterizing the biology of tumors in a completely different way. Previous studies have demonstrated the potential for the use of genome-wide expression analysis, to yield important information about tumor biology that complements the traditional pathological features.
The availability of genome-wide expression data has created a tremendous demand for data mining and machine learning methods that are capable of producing a subset of important features (namely, genes), along with a mathematical model relating variability in those features to variability in tumor class. Xu et al. present a hybrid machine learning approach for identifying gene expression features that are associated with a polytomous (more than binary) tumor endpoint. The authors use an adaptive resonance theory (ART) neural network for classification, with a particle swarm optimizer (PSO) for feature selection. The authors apply this approach to three real datasets, and compare the performance to other methods that have been applied to the data. They were able to show that this approach is competitive with other approaches, such as a probabilistic neural network. The combination of the ART-based neural network with PSO is novel.
An important consideration when evaluating and comparing data mining and machine learning methods is whether the methods that seem to be performing the best are actually finding the true signal in the noise. This is very difficult to assess when the benchmarking is performed on real datasets, since the truth is not knowable. The alternative is to compare methods using simulated data (where the signal is engineered into a noisy dataset). The challenge with simulated or artificial data is that the engineering of a realistic pattern in the data may be difficult, and there are typically many assumptions made that might not be valid. However, simulated data seems like a good starting point for any new or novel method. It will provide an important baseline for performance, prior to the analysis of real data. Once a new method is applied to real data, and compared to other methods, the gold standard should be the biological interpretation (rather than the classification accuracy, for example). A method that produces a good classifier with a biologically meaningful model will be more valuable to a clinical oncologist than a good classifier that can’t be interpreted. These are all important things to keep in mind when developing and evaluating new classification methods for high-dimensional biological datasets.