# Statistical Methods for Data Integration and Applications to Genome-wide Association Studies

> **NIH NIH R01** · JOHNS HOPKINS UNIVERSITY · 2024 · $421,361

## Abstract

Abstract
Large-scale epidemiologic studies, including biobanks and genome-wide association
studies (GWAS), are now rapidly leading to the identification of novel risk factors for
complex diseases. There is increasing opportunity to develop comprehensive models for
disease risk incorporating genetic markers, other biomarkers, life-style factors and
sociodemographic indicators. There are, however, major challenges as information on all
of the potential risk factors are often not available in a single adequately large study.
Instead, information may be available from different studies, each of which may include
some subsets of the desired variables. Further, because of privacy concerns with
individual-level data, only summary-level information, i.e., estimates of model
parameters, may be available from some studies. We propose to develop a series of
novel statistical methods that will allow data integration across disparate datasets to
tackle modern problems faced in genetic and epidemiologic studies. In Aim 1, we will
develop a framework for building generalized linear models using detail covariate data
from a main study, while incorporating summary-statistics information from an external
study. We will develop a series of applications of this framework to GWAS where we will
use covariate data from biobanks and perform combined analysis with external summary-
statistics data for powerful exploration of gene-environment interactions and mediations.
In Aim 2, we will extend the proposed framework of Aim 1 for developing models with
high-dimensional covariates with regularized parameter estimates. We will develop novel
computational algorithms for practical implementation of the method for large-scale data
analysis and develop new theory for inference on model parameters. We will further
develop application of the proposed method for fine-mapping and polygenic risk score
analysis conditional on covariates. In Aim 3, we will develop applications of the data
integration framework to account for different accuracy/depth of disease outcome data
across different studies. We will illustrate applications of different methods across the
aims using datasets on cancers (breast, melanoma and lung), cardiometabolic traits
(type-2 diabetes and coronary artery disease) and a psychiatric disorder (major
depression disorder). We will distribute develop and freely distribute user friendly
software.

## Key facts

- **NIH application ID:** 10980746
- **Project number:** 1R01HG013137-01A1
- **Recipient organization:** JOHNS HOPKINS UNIVERSITY
- **Principal Investigator:** Nilanjan Chatterjee
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $421,361
- **Award type:** 1
- **Project period:** 2024-09-23 → 2029-01-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10980746

## Citation

> US National Institutes of Health, RePORTER application 10980746, Statistical Methods for Data Integration and Applications to Genome-wide Association Studies (1R01HG013137-01A1). Retrieved via AI Analytics 2026-05-26 from https://api.ai-analytics.org/grant/nih/10980746. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*