# Discovering and Applying Knowledge in Clinical Databases

> **NIH NIH R01** · COLUMBIA UNIVERSITY HEALTH SCIENCES · 2020 · $582,678

## Abstract

PROJECT SUMMARY / ABSTRACT
The long-term goal of our ongoing project, “Discovering and applying knowledge in clinical databases,” is to
learn from data in the electronic health record (EHR) and to apply that knowledge to understand and improve
health. The EHR, because of its broad capture of human health, greatly amplifies our ability to carry out
observational research, opening the possibility of covering emerging problems, diverse populations, rare
diseases, and chronic diseases in long-term longitudinal studies. Unfortunately, the strength of EHR data—its
breadth and flexible nature—imposes additional challenges. We have found that the biggest challenge comes
from the inaccuracy, incompleteness, complexity, and resulting bias inherent in the recording of the health care
process. We previously showed that health care process bias exists to the extent, for example, that simple use of
the data can create signals implying the opposite of what we know to be true. One of the most important factors
is sparse, irregular sampling; we found that sampling bias can be reduced by reparameterizing time and that
prediction techniques that can accommodate EHR-specific data and resist their biases like data assimilation
can be used on EHR data to produce good estimates of glucose and HA1c. The previous cycle of this project
produced 75 publications.
We propose to develop methods to accommodate health care process bias, using both knowledge engineering
and experience with health care process bias as well as advanced statistical techniques that employ dynamical
models and latent variables. We hypothesize that heuristics and models combined with knowledge can improve
our ability to generate inferences and learn phenotypes despite health care process bias. Our aims are as
follows: (1) Taking a knowledge engineering approach, study the effect of preprocessing and analytic choices on
reducing health care process bias, and using machine learning techniques, learn more about health care
process bias. (2) Taking a more empirical approach, use dynamic latent factor modeling and variation inference
to accommodate health care process bias, learning how a patient's health state and health processes affect
censoring, exploiting information from many variables at once. (3) Use data assimilation and mechanistic
models to learn otherwise unmeasurable physiologic phenotypes despite irregular, sparse sampling typical of
electronic health records. (4) Use the developed models and generated phenotypes to answer clinical questions,
and disseminate the results.

## Key facts

- **NIH application ID:** 9873996
- **Project number:** 5R01LM006910-20
- **Recipient organization:** COLUMBIA UNIVERSITY HEALTH SCIENCES
- **Principal Investigator:** GEORGE M HRIPCSAK
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $582,678
- **Award type:** 5
- **Project period:** 2000-04-01 → 2024-02-28

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9873996

## Citation

> US National Institutes of Health, RePORTER application 9873996, Discovering and Applying Knowledge in Clinical Databases (5R01LM006910-20). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/9873996. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
