# Open Health Natural Language Processing Collaboratory

> **NIH NIH U01** · MAYO CLINIC ROCHESTER · 2021 · $1,487,619

## Abstract

Project Summary
One of the major barriers in leveraging Electronic Health Record (EHR) data for clinical and translational
science is the prevalent use of unstructured or semi-structured clinical narratives for documenting clinical
information. Natural Language Processing (NLP), which extracts structured information from narratives, has
received great attention and has played a critical role in enabling secondary use of EHRs for clinical and
translational research. As demonstrated by large scale efforts such as ACT (Accrual of patients for Clinical
Trials), eMERGE, and PCORnet, using EHR data for research rests on the capabilities of a robust data and
informatics infrastructure that allows the structuring of clinical narratives and supports the extraction of clinical
information for downstream applications. Current successful NLP use cases often require a strong informatics
team (with NLP experts) to work with clinicians to supply their domain knowledge and build customized NLP
engines iteratively. This requires close collaboration between NLP experts and clinicians, not feasible at
institutions with limited informatics support. Additionally, the usability, portability, and generalizability of the
NLP systems are still limited, partially due to the lack of access to EHRs across institutions to train the
systems. The limited availability of EHR data limits the training available to improve the workforce competence
in clinical NLP. We aim to address the above challenges by extending our existing collaboration among
multiple CTSA hubs on open health natural language processing (OHNLP) to share distributional information of
NLP artifacts (i.e., words, n-grams, phrases, sentences, concept mentions, concepts, and text segments)
acquired from real EHRs across multiple institutions. We will leverage the advanced privacy-preserving
computing infrastructure of iDASH (integrating Data for Analysis, Anonymization, and SHaring) for privacy-
preserving data analysis models and will partner with diverse communities including Observational Health Data
Sciences and Informatics (OHDSI), Precision Medicine Initiative (PMI), PCORnet, and Rare Diseases Clinical
Research Network (RDCRN) to demonstrate the utility of NLP for translational research. This CTSA innovation
award RFA provides us with a unique opportunity to address the challenges faced with clinical NLP and
through strong partnership with multiple research communities and leadership roles of the research team in
clinical NLP, we envision that the successful delivery of this project will broaden the utilization of clinical NLP
across the research community. There are four aims planned: i) obtain PHI-suppressed NLP artifacts with
retained distribution information across multiple institutions and assess the privacy risk of accessing PHI-
suppressed artifacts, ii) generate a synthetic text corpus for exploratory analysis of clinical narratives and
assess its utility in NLP tasks leveraging various NLP challenge...

## Key facts

- **NIH application ID:** 10244996
- **Project number:** 5U01TR002062-05
- **Recipient organization:** MAYO CLINIC ROCHESTER
- **Principal Investigator:** Xiaoqian Jiang
- **Activity code:** U01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $1,487,619
- **Award type:** 5
- **Project period:** 2017-09-01 → 2023-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10244996

## Citation

> US National Institutes of Health, RePORTER application 10244996, Open Health Natural Language Processing Collaboratory (5U01TR002062-05). Retrieved via AI Analytics 2026-05-25 from https://api.ai-analytics.org/grant/nih/10244996. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
