It is not reasonable to expect the sequence of amino acids to predict folding because many proteins with similar foldings have quite different sequences. To address this issue, this paper describes how sparse representation classification (SRC) can be used to perform protein folding recognition.
The authors propose SRC as a way to group and predict the foldings. In this system, the training sample set is divided into blocks corresponding to the different classes, which are viewed as collections of points in a Euclidean space with each class represented as a subset. Given the features of a protein, it can be expressed as a linear combination of elements in each block of the training set. Ideally, most of the coefficients will be zero except for those corresponding to the correct class. The system of equations for the coefficients forms an underdetermined system of equations, so a solution with most of the coefficients zero can be formulated as a minimization problem. By using an ℓ1 norm, the problem can be made tractable.
The authors consider the effect of using SRC rather than the support vector machine (SVM) framework in state-of-the-art methods, specifically D-D, Bi-gram, and ACC fold. The features used in the SRC version are the same as those used in these original methods. The result in each case is a one to four percent improvement in overall accuracy. The authors further describe a predictive method that uses all three of the methods in a combined SRC framework.