# Robust Inference in the Presence of Data Heterogeneity and Structured Missing Data

> **NIH NIH R01** · UNIVERSITY OF SOUTHERN CALIFORNIA · 2020 · $236,796

## Abstract

Modern sequencing platforms can sequence tens of billions of bases per run and generate peta-bytes of
data, but individual study sizes may be small. Similarly, a wide variety of health data are now publicly
available to inform health policy decisions, and it may be advantageous to use data from several different
surveys. The ability to aggregate and compare heterogeneous data across different datasets would be
critical to expanding the usable data available for any individual study. We propose systematically studying
two major barriers to this effort: 1) Aggregating different medical and biological datasets; 2) Dealing
with batch effects and structured heterogeneous data. Aim 1 allows us to fully utilize information on
related topics from diverse datasets, as information across different experiments needs to be combined in a
statistically rigorous, reliable way - the process needs to fully exploit the available information, not
introduce biases, and still be systematic and reproducible. Not all experiments study the same set of
variables/features, and combining this information is a non-trivial task. The second aim allows researchers
to handle heterogeneity between individuals or samples, which happens with ubiquity in biological and
health data. For instance, sequencing machines are evolving over time and samples obtained wlth new
technologies cannot be directly compared to samples taken on older systems, even if data was collected in
the same lab. This also applies to samples obtained under different environmental conditions. Currently,
researchers are forced to either ignore such biases, potentially leading to violations of statistical validity, or
limit their analysis to data generated in one batch of samples. This work will extend the set of useful data
available to researchers in a wide variety of domains and provide methods to compare and synthesize
disparate datasets. The proposed work will result in: (1) Development of algorithms with theoretical
performance guarantees for combining information from datasets with small number of overlapping
features; (2) Development of rigorous statistical procedures for hypothesis testing in the presence of within-.
group heterogeneity. These methods are particularly helpful for pre-/post- treatment studies, studies
containing batch effects, or studies where samples are collected over long time periods using different
technologies; (3) Implementation of these methods in case studies to domains in molecular biology (genetic
pathway hypothesis generation) and population survey data for health policy modeling.

## Key facts

- **NIH application ID:** 10000139
- **Project number:** 5R01LM013315-02
- **Recipient organization:** UNIVERSITY OF SOUTHERN CALIFORNIA
- **Principal Investigator:** Meisam Razaviyayn
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $236,796
- **Award type:** 5
- **Project period:** 2019-09-01 → 2022-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10000139

## Citation

> US National Institutes of Health, RePORTER application 10000139, Robust Inference in the Presence of Data Heterogeneity and Structured Missing Data (5R01LM013315-02). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10000139. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*