Publication Date


Document Type


Committee Members

Michael Cairelli (Committee Member), Thomas Rindflesch (Committee Member), Amith Sheth (Advisor)

Degree Name

Master of Science (MS)


This thesis presents research on automatically identifying interestingness in a graph of semantic predications. Interestingness represents a subjective quality of information that represents its value in meeting a user's known or unknown retrieval needs. The perception of information as interesting requires a level of utility for the user as well as a balance between significant novelty and sufficient familiarity. It can also be influenced by additional factors such as unexpectedness or serendipity with recent experiences. The ability to identify interesting information facilitates the development of user-centered retrieval, especially in information semantic summarization and iterative, step-wise searching such as in discovery browsing systems. Ultimately, this allows biomedical researchers to more quickly identify information of greatest potential interest to them, whether expected or, perhaps more importantly, unexpected. Current discovery browsing systems use iterative information retrieval to discover new knowledge - a process that requires finding relevant co-occurring topics and relationships through consistent human involvement to identify interesting concepts. Although interestingness is subjective, this thesis identifies computable quantities in semantic data that correlate to interestingness in user searches. We compare several statistical and rule-based models correlating graph data extracted from semantic predications with concept interestingness as demonstrated in PubMed queries. Semantic predications represent scientific assertions extracted from all of the biomedical literature contained in the MEDLINE database. They are of the form, "subject-predicate-object". Predications can easily be represented as graphs, where subjects and objects are nodes and predicates form edges. A graph of predications represents the assertions made in the citations from which the predications were extracted. This thesis uses graph metrics to identify features from the predication graph for model generation. These features are based on degree centrality (connectedness) of the seed concept node and surrounding nodes; they are also based on frequency of occurrence measures of the edges between the seed concept and surrounding nodes as well as between the nodes surrounding the seed concept and the neighbors of those nodes. A PubMed query log is used for training and testing models for interestingness. This log contains a set of user searches over a 24-hour period, and we make the assumption that co-occurrence of concepts with the seed concept in searches demonstrates interestingness of that concept with regard to the seed concept. Graph generation begins by the selection of a set of all predications containing the seed concept from the Semantic Medline database (our training dataset uses Alzheimer's disease as the seed concept). The graph is built with the seed concept as the central node. Additional nodes are added for each concept that occurs with the seed concept in the initial predications and an edge is created for each instance of a predication containing the two concepts. The edges are labeled with the specific predicate in the predication. This graph is extended to include additional nodes within two leaps from the seed concept. The concepts in the PubMed query logs are normalized to UMLS concepts or Entrez Gene symbols using MetaMap. Token-based and user-based counts are collected for each co-occurring term. These measures are combined to create a weighted score which is used to determine three potential thresholds of interestingness based on deviation from the mean score. The concepts that are included in both the graph and the normalized log data are identified for use in model training and testing.

Page Count


Department or Program

Department of Computer Science

Year Degree Awarded


Creative Commons License

Creative Commons Attribution-Noncommercial-Share Alike 3.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.