Publication Date


Document Type


Committee Members

Dale Courte (Committee Member), Travis Doom (Committee Member), Dan Krane (Committee Member), Michael Raymer (Advisor), Mateen Rizki (Committee Member)

Degree Name

Doctor of Philosophy (PhD)


Understanding the structure and function of proteins is a key part of understanding biological systems. Although proteins are complex biological macromolecules, they are made up of only 20 basic building blocks known as amino acids. The makeup of a protein can be described as a sequence of amino acids. One of the most important tools in modern bioinformatics is the ability to search for biological sequences (such as protein sequences) that are similar to a given query sequence. There are many tools for doing this (Altschul et al., 1990, Hobohm and Sander, 1995, Thomson et al., 1994, Karplus and Barrett, 1998). Most of these tools, however, focus on closely related, or homologous, sequences. Distantly related proteins sequences (remote homologs) are of interest to biologists but remain notoriously difficult to find. This dissertation presents a novel method for finding remote homologs in databases of protein sequences. In this method, proteins are characterized according to physiochemical and sequence-based features. Features are then weighted according to their utility in identifying distantly related protein sequences. The feature weights are optimized by a custom genetic algorithm. Position-specific-scoring matrices are used to further increase the ability of the tuned algorithm to generalize its search capability to new sequences. The resulting search method outperforms the most well-known techniques for finding distant homologs, both in terms of accuracy and computation time.

Page Count


Department or Program

Department of Computer Science and Engineering

Year Degree Awarded