Predicting the drug response of patients in cancer therapy can be obtained by models built on large datasets of in-vitro tests on cancer cell lines. Such data is complex, as different data types are involved: genes, cell lines, and anticancer drugs.
A difficult aspect in building those models is that gene expression data is high dimensional, with the number of genes largely exceeding the number of cell lines. To address this challenge, the author developed kMTrace, a multitask learning method from gene expression data that uses a nonlinear kernel (radial basis function) to extract nonlinear relationships. “Multitask” means that all the available responses are considered together. Three models were built on three public datasets of the genomic profile of cell cancer and their sensitivity to drugs. After manual selection, kMTrace used about 1000 genes, a few hundred cell lines, and about 100 drugs. The results show an improvement in the mean square error (MSE) with respect to other algorithms from literature, and an extensive discussion analyzes the results and explains them in terms of available knowledge.
Even though one model predicts the response to drugs (expressed as IC50, which is a logarithmic measure) with a poor MSE of about three, globally the models help advance knowledge, for instance, individuating associations of drugs and tumors not yet studied. The source code is available--a good service to people in the field.
On the other hand, this is a very technical paper, written for a selected audience; its lesson is difficult to apply outside the problem area addressed.