Computational Analysis of Metabolomic Toxicological Data Derived from NMR Spectroscopy

Document Type


Publication Date



Nuclear magnetic resonance (NMR) spectroscopy is a non-invasive method of acquiring metabolic profiles from biofluids. The most informative metabolomic features, or biomarkers, may provide keys to the early detection of changes within an organism such as those that result from exposure to a toxin. One major difficulty with typical NMR data, whether it come from a toxicological, medical or other source, is that it features a low sample size relative to the number of variables measured. Thus, traditional pattern recognition techniques are not always feasible. The ”curse of dimensionality” is an important consideration in selecting appropriate statistical and pattern recognition methods for the identification of potential biomarkers.

In this thesis, several alternatives for isolating biomarkers are evaluated on NMR-derived toxicological data set and results are compared: the fold test, univariate ranking, the unpaired t-test, and the paired t-test are examined. Potential biomarkers were inspected for differences based on several subjective criteria including ability to identify consistent differences between treatment and control samples and distinguish potential vehicle effects, those effects caused by the method of delivery performed on both treated and control animals.

Based on these results, the paired t-test method is preferred, due to its ability to attribute statistical significance, to take into consideration consistency of a single subject over a time course, and to mitigate the low sample, high dimensionality problem. A protocol for the paired t-test is also proposed to remove potential vehicle effects and identify toxic responses above the vehicle effects. Due to the large number of variables to be considered, a correction for multiple testing must be employed. In this thesis, several methods of correction for multiple test is evaluated. An acceptable p-value cutoff for each correction is proposed so that the most appropriate correction can be applied based on the purpose of the metabolomic toxicology experiment.

Also in this thesis, a more complex method for identifying biomarkers, Orthogonal Projection to Latent Structures Discriminant Analysis (O-PLS-DA), is compared to the t-test using synthetic data sets based on the characterization of experimental NMR spectra. The ranking of potential biomarkers produced by both methods is compared to the ranking of features used to create the synthetic data. In addition, an O-PLS-DA permutation test method of determining an important feature cutoff is evaluated using the synthetic data. The variable-at-a-time t-test method using a p-value threshold is also evaluated for comparison. Based on these results the O-PLS-DA permutation test was not consistent or stable enough to distinguish truly responding biomarkers.

The benefits of O-PLS-DA, including its ability to deal with correlated variables, removal of unwanted systematic variation, and the ability to deal with some amount of missing data, make it sufficient for identifying potential biomarkers. It is determined that O-PLS-DA does not rank potential biomarkers differently than the t-test nor does it classify new samples significantly better or worse than a majority-vote based t-test classifier.