Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Learning local languages and their application to DNA sequence analysis
Yokomori T., Kobayashi S. IEEE Transactions on Pattern Analysis and Machine Intelligence20 (10):1067-1079,1998.Type:Article
Date Reviewed: May 1 1999

This research applies formal languages and the theory of automata to theoretical biology. The specific problem dealt with is to find deterministic finite state automata (DFA) identifying the &agr;-chain region in the amino acid sequence for hemoglobin.

Machine learning has been successfully applied for the classification, identification, and prediction of biological properties of proteins and DNA sequences. Generally, a homology technique is used to construct a template. In contrast, this paper deals with automatic means of deriving the template from the sample data.

In the early 1960s, it was believed that a class of languages with all finite sets is not learnable if it contains even one infinite set. Recently, it has been shown that learning is possible in the limit from positive data. The regular languages are classified into reversible, splicing, and strictly locally testable languages. Locally testable languages are related to a subclass of first-order logic that is useful to describe genetic information. Thus, this research has an impact on bio-informatics.

The paper consists of five sections. The first introduces regular languages, k-testable languages, and the real-life problem in DNA sequencing. Section 2 introduces the concepts of strictly deterministic finite state automata and learning in the limit and an algorithm for learning a language. In section 3, a linear-time learning algorithm for k-testable languages, in the limit for positive data alone, is presented in the form of lemmas with proofs. A language L is strictly locally testable if there exists an integer k ≥ 1 such that L is strictly k-testable [1]. The application of the algorithms was tested to identify protein &agr;-regions in amino acid sequences. The success of the new algorithm is around 95 percent for identification of positive data and 96 percent for negative data. The last section discusses related literature and presents concluding remarks.

Head’s theoretical work [2] suggests that amino acid sequences can be expressed by a strictly locally testable language. Hemoglobin consists of more than one polypeptide chain. The heme group is associated with two &agr; and two &bgr; subunits. The &agr;-chain of the hemoglobin is a polypeptide comprising 141 amino acids. The authors used two kinds of letter-to-letter translation. The first is from a 20- to a 6-letter alphabet, due to Dayhof’s coding method, and the other is to a binary alphabet. The raw positive data consists of finite sets of strings of 20 symbols, each corresponding to an amino acid.

The authors arrived at a specific type of DFA Mα called an &agr;-chain DFA for hemoglobin, which can be used to identify the &agr;-chain region for unknown amino acid sequences. It is also possible to predict the &agr;-helix region in amino acid sequences in other proteins using the present algorithm.

Reviewer:  R. Sambasiva Rao Review #: CR122249 (9905-0367)
1) McNaughton, R. and Papert, S. Counter-free automata. MIT Press, Cambridge, MA, 1971.
2) Head, T. Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviours. Bull. Math. Biol. 49 (1987), 737–759.
Bookmark and Share
 
Formal Languages (F.4.3 )
 
 
General (I.6.0 )
 
Would you recommend this review?
yes
no
Other reviews under "Formal Languages": Date
The three subfamilies of rational &ohgr;-languages closed under &ohgr;-transduction
Timmerman E. Theoretical Computer Science 76(2-3): 243-250, 2001. Type: Article
Jul 1 1992
Introduction to formal languages
Révész G., Dover Publications, Inc., New York, NY, 1991. Type: Book (9780486666976)
Jan 1 1993
Fibonacci morphisms and Sturmian words
Séébold P. Theoretical Computer Science 88(2): 365-384, 1991. Type: Article
Oct 1 1992
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy