Publication Date
2021
Document Type
Dissertation
Committee Members
Michael L. Raymer, Ph.D. (Advisor); David R. Cool, Ph.D. (Committee Member); Lynn K. Hartzler, Ph.D. (Committee Member); Travis E. Doom, Ph.D. (Committee Member); Courtney E.W. Sulentic, Ph.D. (Committee Member)
Degree Name
Doctor of Philosophy (PhD)
Abstract
Computational models may assist in identification and prioritization of large chemical libraries. Recent experimental and data curation efforts, such as from the Tox21 consortium, have contributed towards toxicological datasets of increasing numbers of chemicals and toxicity endpoints, creating a golden opportunity for the exploration of multi-label learning and deep learning approaches in this thesis. Multi-label classification (MLC) methods may improve model predictivity by accounting for label dependence. However, current measures of label dependence, such as correlation coefficient, are inappropriate for datasets with extreme class imbalance, often seen in toxicological datasets. In this thesis, we propose a novel label dependence measure that directly models the conditional probability of a label-pair and displays greater sensitivity than correlation coefficient for labels with low prior probabilities. MLC models using data-driven label partitioning based on this measure was generally non-inferior to MLC models using random label partitioning. Marginal improvements in model predictivity have prompted toxicology modelers to shy away from deep learning and resort to ‘simpler’ models, such as k-nearest neighbors, for its greater explainability. Given the prevalence of local, linear quantitative structure-activity relationship (QSAR) models in computational toxicology, we hypothesize that toxicological datasets have locally-linear data structures, resulting in heterogeneous classification spaces that challenges the basic assumptions of most machine learning algorithms. We propose the locality-sensitive deep learner, a modification of deep neural networks which uses attention mechanism to learn datapoint locality. On carefully-constructed synthetic data with extremely unbalanced classes (10% active) and (60%) cluster-specific noise, the locality-sensitive deep learner with learned feature weights retained high test performance (AUC>0.9), while the feed-forward neural network appeared to over-fit the data (AUC<0.6). For the Tox21 dataset [1], locality-sensitive deep learner out-performed feed-forward neural network in 9 out of 12 labels. For acetylcholinesterase inhibition (AChEi) [2], Collaborative Modeling Project for Androgen Receptor Activity (CoMPARA) [3], and Acute Oral Toxicity (AOT) [4] datasets, the combination of locality-sensitive deep learner with feed-forward neural network showed increased model performance compared to individual models in most cases. Generalizing machine learning models to fit locally-linear data or to leverage label dependence may improve model predictivity.
Page Count
126
Department or Program
Biomedical Sciences
Year Degree Awarded
2021
Copyright
Copyright 2021, some rights reserved. My ETD may be copied and distributed only for non-commercial purposes and may not be modified. All use must give me credit as the original author.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.
ORCID ID
0000-0002-1092-7956