Publication Date

2021

Document Type

Dissertation

Committee Members

Michael Raymer, Ph.D. (Advisor); Michael Markey, Ph.D. (Committee Member); Travis Doom, Ph.D. (Committee Member); Tanvi Banerjee, Ph.D. (Committee Member)

Degree Name

Doctor of Philosophy (PhD)

Abstract

Sample mislabeling or incorrect annotation has been a long-standing problem in biomedical research and contributes to irreproducible results and invalid conclusions. These problems are especially prevalent in multi-omics studies in which a large set of biological samples are characterized by multiple types of omics platforms at different times or different labs. While multi-omics studies have demonstrated tremendous value in understanding disease biology and improving patient outcomes, the complexity of these studies may increase opportunities for human error. Fortunately, the interrelated nature of the data collected in multi-omics studies can be exploited to facilitate the identification and, in some cases, correction of mislabeling errors. The dissertation proposed a pipeline comprising statistical and machine learning techniques to identify mislabeled samples and correct the sample labels. Expected correlations between copy number variation, gene transcript abundance, protein abundance and microRNA expression were used to identify mislabeled samples. In datasets with only two omics data, the label corrections were performed by exploiting gender-specific indicators of the mislabeled samples; whereas in datasets with more than two omics data, a network topology realignment method was proposed to perform label correction. We demonstrated the effectiveness of the pipeline in several cancer datasets by simulation experiments. The pipeline was then performed on several public multi-omics datasets and in overall, 2.71% of the samples are found to be mislabeled.

Page Count

109

Department or Program

Department of Computer Science and Engineering

Year Degree Awarded

2021

ORCID ID

0000-0002-1210-9316


Share

COinS