# Automated domain adaptation for clinical natural language processing

> **NIH NIH R01** · BOSTON CHILDREN'S HOSPITAL · 2020 · $383,874

## Abstract

Project Summary
Automatic extraction of useful information from clinical texts enables new clinical research tasks
and new technologies at the point of care. The natural language processing (NLP) systems that
perform this extraction rely on supervised machine learning. The learning process uses
manually labeled datasets that are limited in size and scope, and as a result, applying NLP
systems to unseen datasets often results in severely degraded performance. Obtaining larger
and broader datasets is unlikely due to the expense of the manual labeling process and the
difficulty of sharing text data between multiple different institutions. Therefore, this project
develops unsupervised domain adaptation algorithms to adapt NLP systems to new data.
Domain adaptation describes the process of adapting a machine learning system to new data
sources. The proposed methods are unsupervised in that they do not require manual labels for
the new data.
This project has three aims. The first aim makes use of multiple existing datasets for the same
task to study the differences in domains, and uses this information to develop new domain
adaptation algorithms. Evaluation uses standard machine learning metrics, and analysis of
performance is tightly bounded by strong baselines from below and realistic upper bounds, both
based on theoretical research on machine learning generalization. The second aim develops
open source software tools to simplify the process of incorporating domain adaptation into
clinical text processing workflows. This software will have input interfaces to connect to methods
developed in Aim 1 and output interfaces to connect with Apache cTAKES, a widely used open-
source NLP tool. Aim 3 tests these methods in an end-to-end use case, adverse drug event
(ADE) extraction on a dataset of pediatric pulmonary hypertension notes. ADE extraction relies
on multiple NLP systems, so this use case is able to show how broad improvements to NLP
methods can improve downstream methods. This aim also creates new manual labels for the
dataset for an end-to-end evaluation that directly measures how improvements to the NLP
systems lead to improvement in ADE extraction.

## Key facts

- **NIH application ID:** 9986899
- **Project number:** 5R01LM012918-03
- **Recipient organization:** BOSTON CHILDREN'S HOSPITAL
- **Principal Investigator:** Timothy A Miller
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $383,874
- **Award type:** 5
- **Project period:** 2018-09-01 → 2023-07-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9986899

## Citation

> US National Institutes of Health, RePORTER application 9986899, Automated domain adaptation for clinical natural language processing (5R01LM012918-03). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/9986899. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
