# Statistical Analysis of Large Genomic Data Sets

> **NIH NIH R01** · CASE WESTERN RESERVE UNIVERSITY · 2024 · $466,949

## Abstract

ABSTRACT
Recent large genome wide association studies (GWAS) have identified hundreds to thousands of genetic
variants associated with complex traits. The resulting GWAS summary statistics, together with large Biobank
data, provide an unprecedented opportunity to understand the genetic mechanisms of complex traits. Inferring
causal effects of risk factors on disease is the major challenge of observational epidemiology studies, which now
can be addressed using genomic data through the cost-efficient Mendelian Randomization (MR) approach.
However, current MR approaches suffer from bias due to multiple sources, including weak instrument variables
(IVs), sample overlap, horizontal pleiotropy, and linkage disequilibrium (LD) among IVs. Novel statistical
methods that can unbiasedly infer causality and estimate causal effects are therefore needed. On the other hand,
one of the dominant views in the field is that genetic variation of complex disease is largely explained by additive
effects. Even though gene-environment and gene-gene interactions have been well documented in experiment
studies, the contribution of interactions is still unclear, partially because of limitations of current analytic
approaches. The current methodological development focuses on improving computational efficiency to
overcome the burden from the large number of interaction tests at the genome wide level, but the fundamental
method is based on standard linear regression that often has low statistical power. In this project, we will develop,
(1) novel unbiased multivariable MR with application to large genomics data and Biobank data, (2) novel powerful
gene-environment interaction (𝐺 × 𝐸) methods with application to large genomics and Biobank data, (3) a novel
powerful gene-gene interaction (𝐺 × 𝐺) method with application to large genomics and Biobank data, (4)
corresponding software that will be made publicly available. We will apply these methods and software to UK
Biobank, TOPMED WGS and All of Us, as well as many existing GWAS summary statistic datasets. We request
support to develop statistical methods and software to address these goals. The proposed novel multivariable
MR methods and 𝐺 × 𝐸 and 𝐺 × 𝐺 methods would speed up the new discoveries and improve our understanding
of genetic architecture of complex traits, which aligns with the National Human Genome Research Institute
mission.

## Key facts

- **NIH application ID:** 10876725
- **Project number:** 2R01HG011052-05
- **Recipient organization:** CASE WESTERN RESERVE UNIVERSITY
- **Principal Investigator:** XIAOFENG ZHU
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $466,949
- **Award type:** 2
- **Project period:** 2020-05-08 → 2028-04-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10876725

## Citation

> US National Institutes of Health, RePORTER application 10876725, Statistical Analysis of Large Genomic Data Sets (2R01HG011052-05). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10876725. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*