A Context-Driven Subgraph Model for Literature-Based Discovery

Document Type


Publication Date



Literature-Based Discovery (LBD) refers to the process of uncovering hidden connections that are implicit in scientific literature. Numerous hypotheses have been generated from scientific literature using the LBD paradigm, which influenced innovations in diagnosis, treatment, preventions and overall public health. However, much of the existing research on discovering hidden connections among concepts have used distributional statistics and graph-theoretic measures to capture implicit associations. Such metrics do not explicitly capture the semantics of hidden connections. Rather, they only allude to the existence of meaningful underlying associations. To gain in-depth insights into the meaning of hidden (and other) connections, complementary methods have often been employed. Some of these methods include: 1) the use of domain expertise for concept filtering and knowledge exploration, 2) leveraging structured background knowledge for context and to supplement concept filtering and 3) developing heuristics a priori to help eliminate spurious connections. While effective in some situations, the practice of relying on domain expertise, structured background knowledge and heuristics to complement distributional and graph-theoretic approaches, has serious limitations. The main issue is that the intricate context of complex associations is not always known a priori and cannot easily be computed without under- standing the underlying semantics of the associations. Complex associations should not be overlooked, since they are often needed to elucidate the mechanisms of interaction and causality relationships among concepts. Moreover, they can capture the broader aspects of a biomedical sub-domain by segregating associations along different thematic dimensions, such as Metabolic Function, Pharmaceutical Treatment and Neurological Activity. This dissertation proposes an innovative context-driven, automatic subgraph creation method for finding hidden and complex associations among concepts, along multiple thematic dimensions. It outlines definitions for context and shared context, based on implicit and explicit (or formal) semantics, which compensate for deficiencies in statistical and graph-based metrics. It also eliminates the need for heuristics a priori. An evidence-based evaluation of the proposed framework showed that 8 out of 9 existing scientific discoveries could be recovered using this approach. Additionally, insights into the meaning of associations could be obtained using provenance provided by the system. In a statistical evaluation to determine the interestingness of the generated subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE, on average. These results suggest that leveraging implicit and explicit context, as defined in this dissertation, is an advancement of the state-of-the-art in LBD research.


Cameron's PhD Dissertation Defense, given on August 18, 2014.

Video of the defense can be found at https://www.youtube.com/watch?v=3zuCYjSV0b8.