Publication Date
2015
Document Type
Thesis
Committee Members
Tanvi Banerjee (Committee Member), Amit Sheth (Advisor), Krishnaprasad Thirunarayan (Committee Member)
Degree Name
Master of Science (MS)
Abstract
With the advent of web search and microblogging, the percentage of Online Health Information Seekers (OHIS) using these services to share and seek health information in real-time has increased exponentially. Recently, Twitter has emerged as one of the primary mediums for sharing and seeking of the latest information related to a variety of topics, including health information. Although Twitter is an excellent information source, the identification of useful information from the deluge of tweets is one of the major challenges. Twitter search is limited to keyword-based techniques to retrieve information for a given query and sometimes the results do not contain up-to-date (real-time) information. Moreover, Twitter does not utilize semantics to retrieve results. To address these challenges, we developed a system termed Social Health Signals, by leveraging rich domain knowledge to extract relevant and reliable health information from Twitter in near real-time. We have used semantics based techniques to 1) retrieve relevant and reliable health information shared on Twitter in real-time, 2) enable question answering, 3) to rank results based on relevancy, popularity and reliability, and 4) to enable efficient browsing of the results, we semantically group the search results into health categories In our approach, we have considered Twitter to search documents based on several unique features, including triple-pattern based mining, near real-time retrieval, and tweet contained URL based search. First, the triple-based pattern (subject, predicate, and object) mining technique extracts triple patterns from microblog messages related to chronic health conditions. The triple pattern is defined in a user given question (natural language). Second, in order to make the system near real-time, the search results are divided into intervals of six hours. Third, in addition to tweets, we use the content of the URLs (mentioned in the tweet) as the data source. Finally, the results are ranked according to relevancy and popularity such that at a particular time the most relevant information for the questions are displayed instead of basing results solely on temporal relevance. Our evaluation focuses on questions related to diabetes, such as "How to control diabetes?," and compare the results with a Twitter search. To measure our results with Twitter, we have selected reliability, relevancy, and real-time features for the evaluation. We have conducted a blind survey to check the relevance of the results in which we selected three questions dealing with diabetes. To evaluate the reliable source, we compared a Google domain pagerank of our top 10 results with the Twitter's top 10 results. Also, for real-time we have compared timestamp of the Twitter search results with our system's search results.
Page Count
63
Department or Program
Department of Computer Science
Year Degree Awarded
2015
Copyright
Copyright 2015, all rights reserved. This open access ETD is published by Wright State University and OhioLINK.