Publication Date

2019

Document Type

Dissertation

Committee Members

Krishnaprasad Thirunarayan, Ph.D. (Advisor); Keke Chen, Ph.D. (Committee Member); Guozhu Dong, Ph.D. (Committee Member); Steven Gustafson, Ph.D. (Committee Member); Srinivasan Parthasarathy, Ph.D. (Committee Member); Valerie L. Shalin, Ph.D. (Committee Member)

Degree Name

Doctor of Philosophy (PhD)

Abstract

Information Extraction (IE) techniques are developed to extract entities, relationships, and other detailed information from unstructured text. The majority of the methods in the literature focus on designing supervised machine learning techniques, which are not very practical due to the high cost of obtaining annotations and the difficulty in creating high quality (in terms of reliability and coverage) gold standard. Therefore, semi-supervised and distantly-supervised techniques are getting more traction lately to overcome some of the challenges, such as bootstrapping the learning quickly. This dissertation focuses on information extraction, and in particular entities, i.e., Named Entity Recognition (NER), from multiple domains, including social media and other grammatical texts such as news and medical documents. This work explores the ways for lowering the cost of building NER pipelines with the help of available knowledge without compromising the quality of extraction and simultaneously taking into consideration feasibility and other concerns such as user-experience. I present a type of distantly supervised (dictionary-based), supervised (with reduced cost using entity set expansion and active learning), and minimally-supervised NER approaches. In addition, I discuss the various aspects of the knowledge-enabled NER approaches and how and why they are a better fit for today's real-world NER pipelines in dealing with and partially overcoming the above-mentioned difficulties. I present two dictionary-based NER approaches. The first technique extracts location mentions from text streams, which proved very effective for stream processing with competitive performance in comparison with ten other techniques. The second is a generic NER approach that scales to multiple domains and is minimally supervised with a human-in-the-loop for online feedback. The two techniques augment and filter the dictionaries to compensate for their incompleteness (due to lexical variation between dictionary records and mentions in the text) and for eliminating the noise and spurious content in them. The third technique I present is a supervised approach but with a reduced cost in terms of the number of labeled samples and the complexity of annotating. The cost reduction was achieved with the help of a human-in-the-loop and smart instance samplers implemented using entity set expansion and active learning. The use of knowledge, the monitoring of NER models' accuracy, and the full exploitation of inputs from the human-in-the-loop was the key to overcoming the practical and technical challenges. I make the data and code for the approaches presented in this dissertation publicly available.

Page Count

144

Department or Program

Department of Computer Science and Engineering

Year Degree Awarded

2019

Creative Commons License

Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.


Share

COinS