Publication Date

2010

Document Type

Thesis

Committee Members

Travis Doom (Advisor), Sridhar Ramachandran (Committee Member), Michael Raymer (Committee Member)

Degree Name

Master of Science in Computer Engineering (MSCE)

Abstract

As the number of complete genomes that have been sequenced continues to grow rapidly, the identification of genes regions in DNA sequence data remains one of the most important open problems in bio-informatics. Improving the accuracy of such gene finding tools by a small percentage would affect accurate predictions of many genes of an organism (Zhu et al., 2010). This thesis presents a novel approach for identifying coding regions of a genome based on dipeptide usage.

The patterns in dipeptide usage are used to discriminate between coding and non-coding DNA regions. Two sample T-tests are used as tests of significance to determine the dipeptides that show significant difference in their occurrences in coding and non-coding regions. These methods are primarily tested on Escherichia coli -536 genome, where they reached an accuracy of 96.5% in identifying coding region and 100% accuracy in identifying non-coding regions. The trained classifier data Escherichia coli-536's genome is utilized to predict the coding and non-coding regions of Salmonella enterica subsp. enterica serovar Typhi's genome. The results of these experiments showed an accuracy of 79.5% in predicting coding regions and 100% in predicting non-coding regions of Salmonella enterica subsp. enterica serovar Typhi's genome.

Page Count

119

Department or Program

Department of Computer Science and Engineering

Year Degree Awarded

2010


Share

COinS