Improving health data quality by assessing and enhancing semantic integrity

NIH RePORTER · AHRQ · R01 · $399,999 · view on reporter.nih.gov ↗

Abstract

As terminologies change and are used over time by different entities, there can develop changes and divergence in what the use of a single code or a set of codes represent. These can range from adding a new, more special meaning, to a code set (e.g., adding another possibility to codes whose meaning was first listed as a NEC (not elsewhere classified) code, to a major version change (as in the transition from ICD-9CM to ICD-10CM), and to local adaption (e.g. using a more general code to indicate a more specific condition by an institution.) Such altering of the semantics of codes presents a challenge that can be termed representational semantic integrity (RS integrity). If multiple codes or multiple combinations of codes can represent the same phenotype, cohort identification or cohort variable assignment based on the codes becomes problematic. As numerous research projects utilize large electronic health record (EHR) datasets containing standardized terminology codes, violations of RS integrity would be expected to propagate errors in subsequent analyses and findings. The proposed project seeks to address the question: How to assess and improve RS integrity in longitudinal and heterogenous EHR data using automated methods? We propose to develop novel data driven methods to analyze the temporal pattern and the context of EHR variables. Using ICD-9CM, ICD-10CM, CPT and SNOMED codes as our use cases, this study will leverage very large, longitudinal, heterogenous datasets: the Clinical Data Warehouse of the Veteran Administration (VA)’s national EHR system, the Cerner Real World Data (RWD) and the EHR data repository from a large medical center at University of Alabama at Birmingham (UAB). Our aims are: 1) Develop data-driven approaches to assess RS integrity in longitudinal EHR data. We will develop statistical and deep learning models to perform multivariate time-series analysis for the purpose of detecting aberrant signals in codes in EHR records; 2) Develop data-driven approaches to improve RS integrity in longitudinal EHR data. We will analyze the contexts of codes over time and across data sources using embedding techniques and develop a semantic matching tool that generates semantic equivalent clusters for data from different time periods and facilities; and 3) Validate the assessment and improvement approaches on different coding sets and data sources. We will also assess the impact on predicative modeling.

Key facts

NIH application ID
10446586
Project number
1R01HS028450-01A1
Recipient
GEORGE WASHINGTON UNIVERSITY
Principal Investigator
STUART James NELSON
Activity code
R01
Funding institute
AHRQ
Fiscal year
2022
Award amount
$399,999
Award type
1
Project period
2022-07-01 → 2026-04-30