Document Type

Conference Proceeding

Publication Date



Now that a draft sequence of the human genome is nearly complete, questions regarding both the information contained within our genetic blueprints as well as the manner in which that information content changes over time can be addressed in ways that had not previously been possible. By their very nature, some of the nucleotide sequences present within our genome allow detailed examination of the mode and pattern of evolution that has shaped our genetic instructions over time spans of tens of millions of years. Alu repeats are one example. Using these relatively short, ubiquitous DNA sequences we explore the problem of attempting to predict the relative abundance of a variety of different possible substitution events that have accumulated over the past 20 million years. To perform well when applied to biological sequence data, computational methods must have the ability to tolerate both natural variation in the data and noise introduced in data measurement. As a result and due to their ability to search complex, noisy search spaces, Evolutionary computation techniques are particularly promising for the analysis of nucleotide sequence data and other biological data sets. We have used these techniques to address a key question in understanding the process of evolution: the effect of genomic context on substitutions (the degree to which the genomic information surrounding a particular region of a chromosome affects the changes to that region over time). We utilized genetic programming to predict changes in these DNA sequences over time. These approaches reveal that a significant proportion of DNA nucleotide substitutions within a given region are governed by a model that takes into consideration only the GC-content of the DNA sequences surrounding the region being considered.


Presented at the Fourteenth Midwest Artificial Intelligence and Cognitive Sciences Conference, Cincinnati, OH, April 12-13, 2003.