# Bridging the Semantic Gap Between Research Eligibility Criteria and Clinical Data

> **NIH NIH R01** · COLUMBIA UNIVERSITY HEALTH SCIENCES · 2020 · $618,617

## Abstract

Project Summary
 Our long-term goal is to optimize the design and conduct of human clinical research using informatics1.
Eligibility criteria define the study population for every human study. Their clarity, accuracy and precision are
crucial to the success of participant recruitment, results dissemination, and evidence synthesis. Our goal for this
renewal is to build a data-driven and knowledge-based decision aid for real-life clinical researchers to optimize
research eligibility criteria definition.
 The difference in the semantic representation of an eligibility criterion (e.g., having Type 2 diabetes mellitus)
and its operationalization as a clinical variable (e.g., HbA1C ≥ 6.5% or ICD-9 code = ‘250.00’) has been defined
as the semantic gap2, the closing of which is a grand challenge for biomedical informatics2,3. Our research has
contributed to the in-depth understanding of this semantic gap and how it limits computational reuse and effective
communication of eligibility criteria to key stakeholders of clinical research4-9. We have developed informatics
methods to help bridge this gap, by transforming free-text eligibility criteria into semi-structured formats to aid in
study cohort identification10-13, analysis of the population representativeness of related clinical trials14-19, text
mining of common eligibility features and their trends18,20-24, and identification of questionable exclusion criteria
for mental disorder trials25. We used several of these methods to develop a visualization system called VITTA17
that shows how eligibility criteria and the clinical features of clinical trial populations vary across related trials.
 More importantly, our research has revealed an understudied root cause of the semantic gap, which is that
eligibility criteria are often poorly defined, inaccurate, nonspecific, or imprecise, and not easily translatable to the
real-world electronic health record (EHR) data representations to which the criteria must be operationalized. The
advent of Big Patient Data offers an unprecedented opportunity to draw on the characteristics of real-world
patients to guide and inform the data-driven precise definition of eligibility criteria25. By defining the characteristics
of the intended study population, eligibility criteria critically influence the population representativeness of a
clinical study, which further influences the tradeoff between patient safety and research results’ replicability and
generalizability. We hypothesize that by integrating patient data, including clinical and genomic data, with public
clinical trial information, we can proactively guide investigators to optimize the precision, recruitment feasibility
and representativeness of eligibility criteria. This research will demonstrate a novel data-driven and
knowledge-based system to assist researchers with optimizing eligibility criteria, through innovative informatics
methods for integrating proprietary and public data for deep phenotyping, target p...

## Key facts

- **NIH application ID:** 9983140
- **Project number:** 5R01LM009886-11
- **Recipient organization:** COLUMBIA UNIVERSITY HEALTH SCIENCES
- **Principal Investigator:** CHUNHUA WENG
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $618,617
- **Award type:** 5
- **Project period:** 2017-09-14 → 2023-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9983140

## Citation

> US National Institutes of Health, RePORTER application 9983140, Bridging the Semantic Gap Between Research Eligibility Criteria and Clinical Data (5R01LM009886-11). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/9983140. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
