The University of Edinburgh -
Division of Informatics
Forrest Hill & 80 South Bridge

MSc Thesis #93122

Title:Basic Gene Grammars and Dna-Chart Parser - a Simple Dna Parsing System.
Date: 1993
Abstract:Deoxyribonucleic acid (DNA) encodes genetic information in a kind of language. The field of "DNA linguistics" or "DNA language processing" has emerged recently after pioneering work by AI researchers and molecular biologists. Most of the previous work uses Definite Clause Grammars for representing DNA sequences. The present study provides a simple DNA chart parsing system, comprising a grammar formalism called Basic Gene Grammars and a chart parser DNA-ChartParser. The use of Basic Gene Grammars is demonstrated in representing Escherichia coli promoter sequences, which is one of the most studied types of DNA sequences. Many formulations of the knowledge of Escherichia coli promoters, including knowledge acquired from human experts, consensus sequences, statistics (weight matrices), symbolic learning, and neural network learning, can be represented in Basic Gene Grammars. In compliance with Basic Gene Grammars, the DNA-ChartParser provides basic bidirectional parsing facilities. The parser is also able to handle overlapping categories, gap categories, approximate pattern matching, and constraints. By using Basic Gene Grammars and the DNA-ChartParser, the current knowledge in recognizing E. coli promoters was assessed by parsing the DNA sequences in two real-world data sets which comprise most of the known E. coli promoters. The parsing results indicate that consensus sequences and human-devised domain theory are not useful in recognizing E. coli promoters. The statistical (weight matrix) approach performs well in recognizing the E. coli promoters in both the small data set (82% accuracy) and the large data set (72% accuracy). Machine learning approaches perform perfectly (approximately 100% accuracy) in the training data sets but not for the unseen examples (only 55-60% accuracy).

[Search These Pages] [DAI Home Page] [Comment]