Unsupervised-based Distributed Machine Learning for Efficient Data Clustering and Prediction

Vishnu Vardhan Baligodugula, Wright State University

Abstract

Machine learning techniques utilize training data samples to help understand, predict, classify, and make valuable decisions for different applications such as medicine, email filtering, speech recognition, agriculture, and computer vision, where it is challenging or unfeasible to produce traditional algorithms to accomplish the needed tasks. Unsupervised ML-based approaches have emerged for building groups of data samples known as data clusters for driving necessary decisions about these data samples and helping solve challenges in critical applications. Data clustering is used in multiple fields, including health, finance, social networks, education, and science. Sequential processing of clustering algorithms, like the K-Means, Minibatch K-Means, and Fuzzy C-Means algorithms, takes a long time, especially with many data samples, regardless of whether the results obtained may be accurate or not. This thesis proposes parallel and distributed computing unsupervised ML techniques to improve the execution time of different ML algorithms. The application of different ML techniques on each system and their specific variations is outlined. Various parallelized unsupervised ML models are developed, implemented, and tested to demonstrate the efficiency, in terms of execution time and accuracy, of the serial methods as compared to the parallelized ones. For that, parallel K-Means, parallel Minibatch K-Means, and Fuzzy parallel C-Means using an MPI model are developed. A distributed time estimation approach is created that utilizes the AWS could computing architecture. The Sequential, Parallel, and distributed approaches of K-Means, Minibatch K-means, and Fuzzy C-Means are investigated to enhance the outcome of the developed models. The strengths and weaknesses of various ML-based algorithms are analyzed. As a case study, a country dataset for multiple organizations is used to provide financial assistance to nations based on socioeconomi and health factors and use K-Means, Minibatch K-Means, and Fuzzy C-Means sequential Parallel and distributing techniques like AWS to analyze the data. We developed a serial, parallel, and distributive computing technology based on ML AWS architecture to determine the most efficient method through comparative analysis and our research investigations to provide K-Means, Minibatch K-Means, and Fuzzy C-Means execution timings. Our results reveal that Minibatch K-Means outperforms the other two clustering methods in sequential execution while outperforming them in parallel execution. It is observed that all the developed models perform better in the sequential model than in the parallel model. This work concludes that execution times reduce when these models are implemented on distributed platforms, i.e., Amazon SageMaker, a cloud computing platform, with no noticeable impact on the accuracy of the developed models.