Publication Date

2011

Document Type

Thesis

Committee Members

Keke Chen (Committee Member), Shaojun Wang (Advisor), Xinhui Zhang (Committee Member)

Degree Name

Master of Science (MS)

Abstract

A novel probabilistic discriminative model based on conditional random fields, CONTRAfold, has recently been proposed for single sequence RNA secondary structure prediction. By incorporating most of the features which closely mirror the local interaction terms of thermodynamics-based models, the CONTRAfold model has outperformed both probabilistic and physics-based techniques, and received the highest single sequence prediction accuracies. CONTRAfold, like most other RNA secondary structure prediction techniques, requires a collection of RNA sequences with known secondary structure to serve as training data for the algorithm. Manual annotation of RNA sequences is both expensive and time-consuming, and there remains a great deal more sequence data for which structure is not known than there are structurally annotated sequences. In this paper, we present a principled maximum entropy approach to train the same underlying model used in CONTRAfold using both structurally annotated RNA sequences and a large number of unlabeled RNA sequences. We propose a semi-supervised learning technique that using an entropy decomposition method to efficiently compute the gradient of the conditional entropy on unlabeled RNA sequences. Our experimental results show that the proposed maximum entropy semi-supervised learning technique significantly increases the F-value up to 3.5% when unlabeled RNA sequences are included in the training procedure.

Page Count

30

Department or Program

Department of Computer Science

Year Degree Awarded

2011


Share

COinS