In this significant thesis, the author studies two fundamental problems of genomic sequence analysis: gene finding and homology search. The work is organized into four chapters.
The author introduces new techniques for finding genes in genomic sequences, using sequence similarity information, and a flexible framework for combining multiple sources of evidence in gene finding. This work generalizes existing methods for combining complete probability statements to the case of partial statements.
The thesis discusses the problem of finding similar sequences in large sequence databases by improving a popular algorithm using probabilistic models of target sequence similarities. This leads to the finding of more useful evidence for gene finding. It constructs spaced seeds that significantly increase the number of homologous coding regions, and improve the accuracy and running time of similarity search in protein coding regions.
ExonHunter, the gene finder developed in the thesis, combines evidence from proteins, expressed sequence tags (ESTs), and genome alignments. These are tested on human and fruit fly genomic sequences. This gene finder outperforms other gene finders that use only genomic alignments as a source of information.
The thesis introduces hidden Markov models, to form the basis of the gene finder, and to model the properties of sequence similarities.
This study will be of interest to those researching transmembrane protein topology, protein secondary structure prediction, and components of homology.