# Consistency of Uses of ICD Codes for Retrospective Data Analysis

> **NIH VA I21** · U.S. DEPT/VETS AFFAIRS MEDICAL CENTER · 2021 · —

## Abstract

Background: The Clinical Modification (CM) of the 9th Revision of International Classification of Diseases
(ICD) codes have been the standard for clinical, operational and research activities using health record data in
the U.S. for decades. On October 1, 2015, the Centers for Medicare & Medicaid Services (CMS) replaced the
ICD-9CM codes with ICD-10CM codes, which are fundamentally different in structure and concepts from the
ICD-9CM. In many cases, there are no exact matches between these two sets of codes. To make the transition
smooth, the Centers for Disease Control and Prevention (CDC) and the CMS have created General
Equivalence Mappings (GEM) or “crosswalks” that can translate one code set to the other. However, the GEM
does not simply and automatically translate one code to another in a completely reliable way.
Significance/Impact: Health services research relies on accurate and reliable use of ICD codes.
Retrospective analyses using existing EHR data assume the ICD codes to be a relatively consistent
representation of the clinical data. The lack of automated and reliable translation between ICD-9CM and ICD-
10 CM have been shown to result in incorrect estimations of disease prevalence, which may lead to serious
errors in cohort identification, statistical analyses or machine learning models.
Innovation: Existing crosswalk tools such as the GEM were developed solely based on the terms and
hierarchy of the ICD-9CM and ICD-10CM. We propose to study the actual longitudinal and contextual usage of
ICD-9CM and ICD-10CM in EHR. The advantage of a large EHR repository such as the VA clinical data
warehouse (CDW) is that there is a long time series (~20 years in CDW) and extremely rich clinical context
(e.g. demographic, lab, medication and text note) for us to examine the consistency of ICD usage. Specific
Aims: 1) To assess the consistency of ICD-9CM and ICD-10CM usage in VA EHR data, by detecting aberrant
signals using time-series analysis methods; and 2) To improve the consistency of ICD-9CM and ICD-10CM
usage in VA EHR data, using embedding methods to compare usage contexts. Methodology: The Aim 1
analysis will use signal detection methods that have been validated in bio-surveillance. Aim 2 will use
embedding methods to map each ICD-9CM and ICD-10CM code to a latent semantic space based on their
usage context. Terminology and domain experts will review a stratified sample of the results.
Implementation/Next Steps: Findings of this pilot project will be shared with our operational partners in the
VA central office. We envision further investigations building on this pilot to develop a user-friendly ICD
translation tool and more accurate ICD mappings for VA and other EHR datasets over time and across
facilities, and extend the effort beyond ICD to other terminologies.

## Key facts

- **NIH application ID:** 10187290
- **Project number:** 1I21HX003278-01A1
- **Recipient organization:** U.S. DEPT/VETS AFFAIRS MEDICAL CENTER
- **Principal Investigator:** QING ZENG
- **Activity code:** I21 (R01, R21, SBIR, etc.)
- **Funding institute:** VA
- **Fiscal year:** 2021
- **Award amount:** —
- **Award type:** 1
- **Project period:** 2021-02-01 → 2022-07-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10187290

## Citation

> US National Institutes of Health, RePORTER application 10187290, Consistency of Uses of ICD Codes for Retrospective Data Analysis (1I21HX003278-01A1). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10187290. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
