Publication Date
2020
Document Type
Dissertation
Committee Members
Brian Rigling, Ph.D. (Advisor); Fred Garber, Ph.D. (Committee Member); Arnab Shaw, Ph.D. (Committee Member); Joshua Ash, Ph.D. (Committee Member); John Gallagher, Ph.D. (Committee Member)
Degree Name
Doctor of Philosophy (PhD)
Abstract
Clustering algorithms, such as Gaussian mixture models and K-means, often require the number of clusters to be specified a priori. Bayesian nonparametric (BNP) methods avoid this problem by specifying a prior distribution over the cluster assignments that allows the number of clusters to be inferred from the data. This can be especially useful for online clustering tasks, where data arrives in a continuous stream and the number of clusters may dynamically change over time. Classical BNP priors often overestimate the number of clusters, however, leading researchers to develop new priors with more control over this tendency. To date, BNP algorithms resistant to over-clustering have only been implemented for offline processing, utilizing Markov chain Monte Carlo inference. In this dissertation, we derive a novel algorithm for online BNP clustering using variational inference, with explicit control over the over-clustering phenomenon. Additionally, we propose two methods for tuning a critical hyperparameter mid-stream, based on empirical analysis of the BNP cluster assignment prior and a cost function from Gaussian mixture reduction. We demonstrate the effectiveness of our algorithms on dynamic datasets designed specifically to challenge online BNP clustering algorithms. We also show that our algorithms can be employed for practical applications of radar pulse clustering and neural spike sorting, achieving competitive—and often superior—results when compared to classical BNP methods. Furthermore, we exploit the model-based framework to extend our algorithm and tuning methods from purely Gaussian mixtures to handle data with mixed multivariate Gaussian and categorical type, and demonstrate this new extension on real-world data. Our empirical studies indicate that the developments in this dissertation are a significant contribution to the state of the art in BNP clustering.
Page Count
132
Department or Program
Department of Electrical Engineering
Year Degree Awarded
2020
Copyright
Copyright 2020, all rights reserved. My ETD will be available under the "Fair Use" terms of copyright law.