Modern high-throughput biology has demanded that computationally accessible descriptions of the roles of proteins be available. The gene ontology (GO) controlled vocabulary for annotation of protein function, biological process, and cellular location has become the community standard for protein annotation. Chiang and Yu describe an approach to sentence pattern mining that automatically extracts GO terms for proteins from the biomedical literature.
GO annotation has been one of the recent tasks for the Text Retrieval Conference (TREC) information retrieval competition. Automated methods for extraction of GO term-protein relationships must first identify the natural language expressions that correspond to both protein names and GO terms. Extraction of protein names is relatively well studied, and Chiang and Yu found it to be the easier of the two tasks. Extraction of GO terms relies on recognition of variants based on morphological, syntactic, and semantic rules. After GO terms and protein names are identified, sentences in which both a GO term and protein name co-occur are parsed to obtain phrases describing protein function. The phrase structure is used as input for sentence pattern mining. Chiang and Yu’s results with the TREC 2003 data are comparable to the benchmark results, although the comparison is somewhat difficult to make with their figures.
This paper provides a nice overview of the critical issues that must be addressed for extracting GO-protein relationships from literature, describes a promising approach for automated extraction, illustrates the difficulty of extracting GO terms at the same depth in the hierarchy as obtained by manual annotation, and points out promising avenues for future research.