Publication Date

2020

Document Type

Thesis

Committee Members

Michael Raymer, Ph.D. (Advisor); Mateen Rizki, Ph.D. (Committee Member); Krishnaprasad Thirunarayan, Ph.D. (Committee Member)

Degree Name

Master of Science (MS)

Abstract

Sentence embeddings are frequently generated by using complex, pretrained models that were trained on a very general corpus of data. This thesis explores a potential alternative method for generating high-quality sentence embeddings for highly specialized corpora in an efficient manner. A framework for visualizing and analyzing sentence embeddings is developed to help assess the quality of sentence embeddings for a highly specialized corpus of documents related to the 2019 coronavirus epidemic. A Topological Data Analysis (TDA) technique is explored as an alternative method for grouping embeddings for document clustering and topic modeling tasks and is compared to a simple clustering method for effectiveness. The sentence embeddings generated are found to be effective for use in similarity based tasks and group in useful ways when used with the TDA based techniques explored as alternatives to traditional clustering-based approaches.

Page Count

104

Department or Program

Department of Computer Science and Engineering

Year Degree Awarded

2020

Copyright

Download

Included in

Computer Engineering Commons, Computer Sciences Commons

COinS

Browse all Theses and Dissertations

Topological Analysis of Averaged Sentence Embeddings

Publication Date

Document Type

Committee Members

Degree Name

Abstract

Page Count

Department or Program

Year Degree Awarded

Copyright

Included in

Search

Browse

About

Browse all Theses and Dissertations

Topological Analysis of Averaged Sentence Embeddings

Author

Publication Date

Document Type

Committee Members

Degree Name

Abstract

Page Count

Department or Program

Year Degree Awarded

Copyright

Included in

Share

Search

Browse

About