# Learning Universal Patient Representations from Clinical Text with Hierarchical Recurrent Neural Networks

> **NIH NIH R01** · BOSTON CHILDREN'S HOSPITAL · 2021 · $366,696

## Abstract

Project Summary
In this project we develop new methods for extracting important information from electronic health records
based on recurrent neural networks. These methods represent the hierarchical and sequential nature of human
language, leverage large scale datasets to make learning sophisticated representations possible, and make
use of novel sources of supervision that are available at this scale.
The model architecture we propose is a hierarchical recurrent neural network (RNN). This architecture
explicitly represents temporality at multiple different time scales, with stacked RNN layers representing words,
sentences, paragraphs, and documents. At the word level, the model is trained to predict important pieces of
clinical information, such as negation and temporality, using existing labeled data sets. Training for clinical
information extraction at the lowest level ensures that the higher-level models have a foundation of medically
relevant inputs. We are still left with the challenge of training higher-level networks, because these models
require massive amounts of labeled training data to learn. We solve this problem by taking advantage of the
temporal aspect of information in an EHR, and having each higher-level recurrent layer train getting
supervision from the future. For example, the document RNN is trained to predict billing codes and NLP
concept codes that were found in the subsequent document. This source of supervision is scalable, and our
preliminary data shows that it is effective at learning how to generate generalizable patient representations.
The patient representations that our model learns are shareable across multiple tasks, potentially streamlining
EHR-based research by eliminating what was previously a manual step – designing text-based variables to
represent patients. We demonstrate a new workflow for text-based EHR research, showing how the same
representations can be used for two completely distinct phenotyping tasks. These phenotyping studies make
use of high-quality datasets of patients with pulmonary hypertension and autism spectrum disorder at Boston
Children’s Hospital. PH is relatively rare, so finding every patient with a phenotyping algorithm is important for
clinical research. ASD has several sub-phenotypes, and finding large numbers of patients from each sub-
phenotype can help to better understand the mechanisms of ASD. Along with demonstrating the applicability of
our representations on these specific clinical research use cases, we incorporate our patient representations
into the i2b2 clinical research software, making them available to all clinical investigators using this platform at
Boston Children’s Hospital.

## Key facts

- **NIH application ID:** 10085674
- **Project number:** 5R01LM012973-03
- **Recipient organization:** BOSTON CHILDREN'S HOSPITAL
- **Principal Investigator:** Timothy A Miller
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $366,696
- **Award type:** 5
- **Project period:** 2019-02-07 → 2023-09-14

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10085674

## Citation

> US National Institutes of Health, RePORTER application 10085674, Learning Universal Patient Representations from Clinical Text with Hierarchical Recurrent Neural Networks (5R01LM012973-03). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10085674. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
