# Improving health data quality by assessing and enhancing semantic integrity

> **NIH AHRQ R01** · GEORGE WASHINGTON UNIVERSITY · 2023 · $392,674

## Abstract

As terminologies change and are used over time by different entities, there can develop
changes and divergence in what the use of a single code or a set of codes represent. These
can range from adding a new, more special meaning, to a code set (e.g., adding another
possibility to codes whose meaning was first listed as a NEC (not elsewhere classified) code, to
a major version change (as in the transition from ICD-9CM to ICD-10CM), and to local adaption
(e.g. using a more general code to indicate a more specific condition by an institution.) Such
altering of the semantics of codes presents a challenge that can be termed representational
semantic integrity (RS integrity). If multiple codes or multiple combinations of codes can
represent the same phenotype, cohort identification or cohort variable assignment based on
the codes becomes problematic. As numerous research projects utilize large electronic health
record (EHR) datasets containing standardized terminology codes, violations of RS integrity
would be expected to propagate errors in subsequent analyses and findings. The proposed
project seeks to address the question: How to assess and improve RS integrity in
longitudinal and heterogenous EHR data using automated methods? We propose to
develop novel data driven methods to analyze the temporal pattern and the context of
EHR variables. Using ICD-9CM, ICD-10CM, CPT and SNOMED codes as our use cases, this
study will leverage very large, longitudinal, heterogenous datasets: the Clinical Data Warehouse
of the Veteran Administration (VA)’s national EHR system, the Cerner Real World Data (RWD)
and the EHR data repository from a large medical center at University of Alabama at
Birmingham (UAB). Our aims are: 1) Develop data-driven approaches to assess RS
integrity in longitudinal EHR data. We will develop statistical and deep learning models to
perform multivariate time-series analysis for the purpose of detecting aberrant signals in codes
in EHR records; 2) Develop data-driven approaches to improve RS integrity in longitudinal
EHR data. We will analyze the contexts of codes over time and across data sources using
embedding techniques and develop a semantic matching tool that generates semantic
equivalent clusters for data from different time periods and facilities; and 3) Validate the
assessment and improvement approaches on different coding sets and data sources. We
will also assess the impact on predicative modeling.

## Key facts

- **NIH application ID:** 10651693
- **Project number:** 5R01HS028450-02
- **Recipient organization:** GEORGE WASHINGTON UNIVERSITY
- **Principal Investigator:** STUART James NELSON
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** AHRQ
- **Fiscal year:** 2023
- **Award amount:** $392,674
- **Award type:** 5
- **Project period:** 2022-07-01 → 2026-04-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10651693

## Citation

> US National Institutes of Health, RePORTER application 10651693, Improving health data quality by assessing and enhancing semantic integrity (5R01HS028450-02). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10651693. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
