# Privacy-preserving methods and tools for handling missing data in distributed health data networks

> **NIH NIH R01** · UNIVERSITY OF PENNSYLVANIA · 2020 · $567,979

## Abstract

PROJECT SUMMARY
Distributed health data networks (DHDNs) that leverage electronic health records (EHRs) (e.g., eMerge,
pSCANNER, PEDSnet) have drawn substantial interests in recent years, as they a) eliminate the need to
create, maintain, and secure access to central data repositories, b) minimize the need to disclose protected
health information outside the data-owning entity, and c) mitigate many security, proprietary, legal, and privacy
concerns. Missing data are ubiquitous and present analytical challenges in DHDNs. However, very limited
research has been conducted to address missing data in such settings. When applying to a distributed
environment, the current state-of-the-art approaches for handling missing data require pooling raw data into a
central repository before analysis and hence require individual-level data sharing, which may not be feasible
for a number of reasons, including institutional policies prohibiting such sharing, high regulatory hurdles, public
privacy concerns, and costs/overhead of moving massive amounts of data. A large body of research has
demonstrated that given some background information about an individual such as data from EHRs, an
adversary can learn (from “de-identified” data) sensitive information about the individual and improper
disclosure of individual-level data may have serious implications. The proposed research will address the
challenges associated with handling missing data in distributed analysis and fill a crucial methodology gap. We
propose the following specific aims: 1) develop privacy-preserving distributed methods for handling missing
data in horizontally partitioned data; 2) develop privacy preserving distributed methods for handling missing
data in vertically partitioned data; 3) develop a user-friendly toolkit to allow researchers to handle missing data
for distributed analysis in health data networks; and 4) evaluate and validate the methods and tool kit using the
UCSD obesity patient data prepared for pSCANNER, and data from PEDSnet in addition to simulated data.
The proposed approaches will enable using data across multiple sites and will not require pooling patient-level
data into a central repository. They can be scaled up to handle massive amounts of data in DHDNs, because
the decomposed computation can be parallelized to all participating parties. The results of our study will
significantly advance the state-of-the-art in missing data methodology for DHDNs. The privacy-preserving
software toolkit will enable researchers to use more complete data in their research by leveraging information
from multiple sites without compromising patient privacy, and help lower regulatory and other hurdles for
collaboration across multiple institutions and build the public trust. As such, it will encourage more institutions
and healthcare systems to become part of a clinical data research network and more patients to participate in
clinical studies, which will improve the validity, robustness and gen...

## Key facts

- **NIH application ID:** 9939577
- **Project number:** 5R01GM124111-04
- **Recipient organization:** UNIVERSITY OF PENNSYLVANIA
- **Principal Investigator:** Qi Long
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $567,979
- **Award type:** 5
- **Project period:** 2017-09-08 → 2023-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9939577

## Citation

> US National Institutes of Health, RePORTER application 9939577, Privacy-preserving methods and tools for handling missing data in distributed health data networks (5R01GM124111-04). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/9939577. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*