Document Type


Publication Date



Background: Literature-based discovery (LBD) is characterized by uncovering hidden associations in non-interacting scientific literature. Prior approaches to LBD include use of: 1) domain expertise and structured background knowledge to manually filter and explore the literature, 2) distributional statistics and graph-theoretic measures to rank interesting connections and 3) heuristics to help eliminate spurious connections. However, manual approaches to LBD are not scalable and purely distributional approaches may not be sufficient to obtain insights into the meaning of poorly understood associations. While several graph-based approaches have the potential to elucidate associations, their effectiveness has not been fully demonstrated. A considerable degree of a prior knowledge, heuristics and manual filtering is still required.

Objectives: In this paper we implement and evaluate a context-driven, automatic subgraph creation method that captures multifaceted complex associations between biomedical concepts for LBD. Given a pair of concepts, our method automatically generates a ranked list of subgraphs, which provide informative and potentially unknown associations between such concepts.

Methods: To generate subgraphs, the set of all MEDLINE articles that contain either of two specified concepts (A, C) are first collected. Binary relationships or assertions, which are automatically extracted from the MEDLINE articles, called semantic predications, are then used to create a labeled directed predications graph. In this graph, a path is represented as a sequence of semantic predications. The hierarchical agglomerative clustering (HAC) algorithm is then applied to cluster paths, which are bounded by the two concepts (A, C) based on the definition of the context of a path, as a set of Medical Subject Heading (MeSH) descriptors. Paths that exceed a threshold of semantic relatedness are clustered into subgraphs based on their shared context. The automatically generated clusters are then provided as a ranked list of subgraphs.

Results: The subgraphs generated using this approach facilitated the rediscovery of 8 out of 9 existing scientific discoveries. In particular, they directly (or indirectly) led to the recovery of several intermediates (or B-concepts) between A and C, while also providing insights into the meaning of each association. Such meaning is derived from predicates between the concepts, as well as the provenance of the semantic predications in MEDLINE. Additionally, by generating subgraphs on different thematic dimensions (such as Cellular Activity, Pharmaceutical Treatment and Tissue Function), the approach enables a broader understanding of the nature of complex associations between concepts in a domain. In a statistical evaluation to determine the interestingness of the subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE on average.

Conclusion: These results suggest that leveraging the implicit and explicit context provided by manually assigned MeSH descriptors is an effective representation for capturing the underlying semantics of complex associations, along multiple thematic dimensions for LBD.


Author's accepted manuscript will be available for download on April 1, 2016. The final, publisher's version is available via

Obvio demo video can be found at