# DeconDTN: Deconfounding Deep Transformer Networks for Clinical NLP

> **NIH NIH R01** · UNIVERSITY OF WASHINGTON · 2024 · $348,541

## Abstract

Natural Language Processing (NLP) methods have been broadly applied to clinical problems, from recognition
of clinical findings in physician notes to identification of transcribed speech samples indicating changes in
cognitive status. Deep transformer networks (DTNs) have dramatically advanced NLP accuracy. These deep
learning models have multiple hidden layers that may correspond to billions of trainable parameters, allowing
them to apply information learned from training on large unlabeled corpora to a specific task of interest. However,
their size leaves them especially vulnerable to confounding bias, induced by variables that can influence both
the predictor (text) and the outcome (e.g. an associated diagnosis) of a predictive model. Such systematic biases
are a recognized danger in the application of artificial intelligence methods to clinical problems, and are the focus
of NLM NOT-LM-19-003 which invites applications proposing methods to identify and address them. Deep
learning models in general require large amounts of training data, spurring initiatives to aggregate medical data
from across institutional siloes. This can increase data set size and enhance model portability, but leaves the
resulting models vulnerable to confounding by provenance, where models learn to recognize the origin of dataset
components and make biased predictions based on site-specific class distributions (e.g. COVID prevalence).
Such models will assign classes based on indicators of dataset provenance, rather than diagnostically
meaningful linguistic differences, and make erroneous predictions when the provenance-specific distributions at
the point of deployment differ from those in the training set. Confounding of this nature is a pervasive problem
that presents a fundamental barrier to the portability of trained models, and threatens the utility of datasets
assembled from across institutions and services. Unlike traditional statistical and machine learning models, with
deep transformer networks feature representations are distributed across parameters spread throughout the
entire network. New methods are needed to meet the challenge of identifying and mitigating the influence of
confounding variables in such models. In the proposed research we will develop a systematic approach to
Deconfounding Deep Transformer Networks (DeconDTN), embodied in an eponymous and publicly available
set of open source tools for (1) identification of provenance-related biases, (2) mitigation of these biases using
a novel set of validated methods, and (3) systematic evaluation of the resulting effects on model performance.
While DeconDTN will be generally applicable, development and evaluation will occur in the context of three use
cases involving data sets drawn from different sources: classification of speech transcripts from participants with
dementia drawn from two locations, identification of goals-of-care discussions in clinical notes drawn from
multiple studies involving a rang...

## Key facts

- **NIH application ID:** 10774319
- **Project number:** 5R01LM014056-03
- **Recipient organization:** UNIVERSITY OF WASHINGTON
- **Principal Investigator:** Trevor Cohen
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $348,541
- **Award type:** 5
- **Project period:** 2022-06-01 → 2026-02-28

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10774319

## Citation

> US National Institutes of Health, RePORTER application 10774319, DeconDTN: Deconfounding Deep Transformer Networks for Clinical NLP (5R01LM014056-03). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10774319. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
