Statistical Methods for Integrative Analysis of Large-Scale Multi-Ethnic Whole Genome Sequencing Studies and Biobanks of Common Diseases

NIH RePORTER · NIH · R01 · $488,641 · view on reporter.nih.gov ↗

Abstract

This proposal aims to develop advanced and scalable statistical methods for integrative analysis of large-scale Whole Genome Sequencing (WGS) studies and biobanks of common diseases, such as heart and lung diseases. Genome-Wide Association Studies (GWAS) have revealed thousands of genetic variants associated with many common diseases, but are limited to common variants from a majority of individuals of only European ancestry. Large-scale multi-ethnic WGS studies and biobanks have been rapidly arising to overcome these limitations, and to study the genetic underpinnings of complex diseases and traits in both coding and non-coding rare variants across populations. Examples include the NHLBI Trans-Omics Precision Medicine Program (TOPMed) and the NHGRI Genome Sequencing Program (GSP), UK biobank, and All of Us. Various omics data are also available in TOPMed. Full usage of these datasets can fuel genetic discoveries applicable to genetically understudied populations. These studies consist of hundreds of millions of rare variants (RVs), and their analysis faces several challenges. First, although several methods have been developed for RV analysis, they have limited power for analysis of non-coding RVs, as their functions are unknown or cell-type specific. There is a pressing need to empower RV Association Tests (RVATs) for non-coding variants by developing more powerful statistical learning methods using integrative analysis and incorporating cell-type specific variant functional annotations. Second, large sample sizes of WGS studies and data privacy consideration of many national and institutional biobanks with unbalanced case and control ratios call for distributed WGS analyses. Third, it is of substantial interest to develop polygenic risk scores using both common and rare variants in WGS studies, and to investigate causal effects of biomarkers and omics’ markers on diseases using Mendelian Randomization (MR) using both common and rare variants as instrumental variables. This proposal aims at addressing these needs with four aims. First, we will develop statistical learning based ensemble RVATs to boost power. This ensemble RVAT framework will be extended to use cell-type-specific functional annotations calculated from single-cell assays, and to perform meta-analysis. Second, we will develop distributed methods for important tasks in the analysis of large WGS and federated biobank data: estimating population structure via distributed fast principal component analysis, distributed methods for fitting generalized linear mixed models, and distributed RVATs. Third, we will develop methods for polygenic risk score (PRS) using both common and rare variants in WGS studies, and develop Mendelian Randomization methods for studying the causal effects of biomarkers and omics markers on diseases by using WGS-based PRs as instrumental variables. Fourth, we will develop open-access statistical software capable of implementing our proposed methods in both offli...

Key facts

NIH application ID: 10829876
Project number: 5R01HL163560-03
Recipient: HARVARD UNIVERSITY D/B/A HARVARD SCHOOL OF PUBLIC HEALTH
Principal Investigator: XIHONG LIN
Activity code: R01
Funding institute: NIH
Fiscal year: 2024
Award amount: $488,641
Award type: 5
Project period: 2022-05-15 → 2026-04-30