Publication Date
2021
Document Type
Dissertation
Committee Members
Michael Raymer, Ph.D. (Advisor); Michael Markey, Ph.D. (Committee Member); Travis Doom, Ph.D. (Committee Member); Tanvi Banerjee, Ph.D. (Committee Member)
Degree Name
Doctor of Philosophy (PhD)
Abstract
Sample mislabeling or incorrect annotation has been a long-standing problem in biomedical research and contributes to irreproducible results and invalid conclusions. These problems are especially prevalent in multi-omics studies in which a large set of biological samples are characterized by multiple types of omics platforms at different times or different labs. While multi-omics studies have demonstrated tremendous value in understanding disease biology and improving patient outcomes, the complexity of these studies may increase opportunities for human error. Fortunately, the interrelated nature of the data collected in multi-omics studies can be exploited to facilitate the identification and, in some cases, correction of mislabeling errors. The dissertation proposed a pipeline comprising statistical and machine learning techniques to identify mislabeled samples and correct the sample labels. Expected correlations between copy number variation, gene transcript abundance, protein abundance and microRNA expression were used to identify mislabeled samples. In datasets with only two omics data, the label corrections were performed by exploiting gender-specific indicators of the mislabeled samples; whereas in datasets with more than two omics data, a network topology realignment method was proposed to perform label correction. We demonstrated the effectiveness of the pipeline in several cancer datasets by simulation experiments. The pipeline was then performed on several public multi-omics datasets and in overall, 2.71% of the samples are found to be mislabeled.
Page Count
109
Department or Program
Department of Computer Science and Engineering
Year Degree Awarded
2021
Copyright
Copyright 2021, all rights reserved. My ETD will be available under the "Fair Use" terms of copyright law.
ORCID ID
0000-0002-1210-9316