# Modeling the Incompleteness and Biases of Health Data

> **NIH NIH R01** · NORTHWESTERN UNIVERSITY · 2022 · $311,339

## Abstract

Modeling the Incompleteness and Biases of Health Data
Researchers are increasingly working to “mine” health data to derive new medical knowledge. Unlike
experimental data that are collected per a research protocol, the primary role of clinical data is to help
clinicians care for patients, so the procedures for its collection are not often systematic. Thus, missing and/or
biased data can hinder medical knowledge discovery and data mining efforts. Existing efforts for missing health
data imputation often focus on only cross-sectional correlation (e.g., correlation across subjects or across
variables) but neglect autocorrelation (e.g., correlation across time points). Moreover, they often focus on
modeling incompleteness but neglect the biases in health data.
Modeling both the incompleteness and bias may contribute to better understanding of health data and better
support clinical decision making. We propose a novel framework of Bias-Aware Missing data Imputation with
Cross-sectional correlation and Autocorrelation (BAMICA), and leverage clinical notes to better inform the
methods that will otherwise rely on structured health data only. In addition to evaluating its imputation
accuracy, we will apply the proposed framework to assist in downstream tasks such as predictive modeling for
multiple outcomes across a diverse range of clinical and cohort study datasets.
Aim 1 introduces the MICA framework to jointly consider cross-sectional correlation and auto-correlation. In
Aim 2, we will augment MICA to be bias-aware (hence BAMICA) to account for biases stemmed from multiple
roots such as healthcare process and use them as features in imputing missing health data. This augmentation
is achieved by a novel recurrent neural network architecture that keeps track of both evolution of health data
variables and bias factors. In Aim 3, we will supplement unstructured clinical notes to structured health data for
modeling incompleteness and biases using a novel architecture of graph neural network on top of memory
network. We will apply graph neural networks to process clinical notes in order to learn proper representations
as input to the memory networks for imputation and downstream predictive modeling tasks. Depending on the
clinical problem and data availability, not all modules may be needed. Thus our proposed BAMICA framework
is designed to be flexible and consists of selectable modules to meet some or all of the above needs.
In summary, our proposal bridges a key knowledge gap in jointly modeling incompleteness and biases in
health data and utilizes unstructured clinical notes to supplement and augment such modeling in order to better
support predictive modeling and clinical decision making. We will demonstrate generalizability by
experimenting on four large clinical and cohort study datasets, and by scaling up to the eMERGE network
spanning 11 institutions nationwide. We will disseminate the open-source framework. The principled and
flexible framework gen...

## Key facts

- **NIH application ID:** 10381541
- **Project number:** 5R01LM013337-03
- **Recipient organization:** NORTHWESTERN UNIVERSITY
- **Principal Investigator:** Yuan Luo
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $311,339
- **Award type:** 5
- **Project period:** 2020-06-01 → 2024-03-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10381541

## Citation

> US National Institutes of Health, RePORTER application 10381541, Modeling the Incompleteness and Biases of Health Data (5R01LM013337-03). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10381541. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
