Document Type

Conference Proceeding

Publication Date

2025

Abstract

Extracting patient subpopulations (clinically relevant cohorts of individuals who share overlapping symptoms, risk factors, or diagnostic criteria) from unstructured medical notes is an ongoing challenge due to the variability of clinical language and the complex nature of patient conditions. We demonstrate a pipeline that combines named entity recognition (NER), transformer embeddings, guided dimensionality reduction, and LLM-mediated knowledge graph integration to enhance patient extraction. The approach begins with NER using the UMLS metathesaurus [1] to extract clinical terms, followed by transformation into vector embeddings using a biomedical transformer. These embeddings are augmented with structured knowledge graph representations generated through an LLM-driven extraction process and graph embeddings via TransE [2]. To improve the separation of key semantic features, we apply autoencoder-based dimensionality reduction before concatenating term embeddings with their graph-based counterparts. A feedforward neural network with an attention layer classifies extracted embeddings to determine patient subgroup membership. We evaluate the pipeline on multiple datasets, including extracting a subpopulation taken from Dayton Childrens' Hospital, with experiments demonstrating improvements over baseline BERT-only and keyword-based methods in classifying medical reports by specialty and behavioral health relevance. Our results show that incorporating knowledge graphs and dimensionality reduction enhances precision and interpretability while maintaining adaptability for different research queries.

Comments

This work is licensed under CC BY 4.0 Creative Commons Attribution 4.0 License


Share

COinS