Publication Date

2021

Document Type

Dissertation

Committee Members

Michael L. Raymer, Ph.D. (Advisor); David R. Cool, Ph.D. (Committee Member); Lynn K. Hartzler, Ph.D. (Committee Member); Travis E. Doom, Ph.D. (Committee Member); Courtney E.W. Sulentic, Ph.D. (Committee Member)

Degree Name

Doctor of Philosophy (PhD)

Abstract

Computational models may assist in identification and prioritization of large chemical libraries. Recent experimental and data curation efforts, such as from the Tox21 consortium, have contributed towards toxicological datasets of increasing numbers of chemicals and toxicity endpoints, creating a golden opportunity for the exploration of multi-label learning and deep learning approaches in this thesis. Multi-label classification (MLC) methods may improve model predictivity by accounting for label dependence. However, current measures of label dependence, such as correlation coefficient, are inappropriate for datasets with extreme class imbalance, often seen in toxicological datasets. In this thesis, we propose a novel label dependence measure that directly models the conditional probability of a label-pair and displays greater sensitivity than correlation coefficient for labels with low prior probabilities. MLC models using data-driven label partitioning based on this measure was generally non-inferior to MLC models using random label partitioning. Marginal improvements in model predictivity have prompted toxicology modelers to shy away from deep learning and resort to ‘simpler’ models, such as k-nearest neighbors, for its greater explainability. Given the prevalence of local, linear quantitative structure-activity relationship (QSAR) models in computational toxicology, we hypothesize that toxicological datasets have locally-linear data structures, resulting in heterogeneous classification spaces that challenges the basic assumptions of most machine learning algorithms. We propose the locality-sensitive deep learner, a modification of deep neural networks which uses attention mechanism to learn datapoint locality. On carefully-constructed synthetic data with extremely unbalanced classes (10% active) and (60%) cluster-specific noise, the locality-sensitive deep learner with learned feature weights retained high test performance (AUC>0.9), while the feed-forward neural network appeared to over-fit the data (AUC<0.6). For the Tox21 dataset [1], locality-sensitive deep learner out-performed feed-forward neural network in 9 out of 12 labels. For acetylcholinesterase inhibition (AChEi) [2], Collaborative Modeling Project for Androgen Receptor Activity (CoMPARA) [3], and Acute Oral Toxicity (AOT) [4] datasets, the combination of locality-sensitive deep learner with feed-forward neural network showed increased model performance compared to individual models in most cases. Generalizing machine learning models to fit locally-linear data or to leverage label dependence may improve model predictivity.

Page Count

126

Department or Program

Biomedical Sciences

Year Degree Awarded

2021

Creative Commons License

Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.

ORCID ID

0000-0002-1092-7956


Share

COinS