Leveraging biobank-scale whole-genome sequencing for polygenic risk prediction

NIH RePORTER · NIH · R01 · $438,550 · view on reporter.nih.gov ↗

Abstract

Project Summary/Abstract Whole-genome sequencing of population biobank cohorts holds great promise for enabling accurate prediction of genetically-mediated risk for heritable human diseases and traits. Such information has the potential to be a powerful resource for precision medicine, informing preventative and therapeutic decisions. To more fully realize this potential, new statistical methods are needed to incorporate all genetic variants – including structural variants, blood-derived somatic mutations, and rare SNPs and indels – into genetic risk models. These classes of genetic variation, which are known to include many variants with large effects on disease risk, can be detected in high-coverage whole-genome sequencing data now being generated at biobank scale. However, such variants have not been accessible from previous genetic data sets (which have relied on SNP- array genotyping and imputation). Consequently, existing methods for polygenic prediction have typically considered only common inherited SNPs and indels. We propose to develop a suite of statistical methods to enable these additional classes of genetic variants to be incorporated into models of genetic risk, thereby improving predictive power. For variant types that are currently difficult to ascertain even from whole-genome sequencing data – including somatic mutations and some types of structural variants – we will develop new genotyping algorithms that improve statistical inference by harnessing information across large sequenced cohorts. We will efficiently integrate information across all variant types into genetic risk models using fast Bayesian regression methods. We will apply these approaches to train genetic risk models for common diseases using data from very large biobank sequencing projects. This project will have three specific aims. First, we will develop and apply methods for incorporating structural variants into polygenic scores. Many structural variants are known to confer substantial disease risk but are at imperfectly modeled by existing polygenic scores, such that directly including such variants will increase prediction accuracy and cross-ancestry transferability. Second, we will develop and apply methods for incorporating somatic mutations detectable in blood-derived DNA into genetic risk models. Such acquired mutations, often indicative of clonal expansions of blood cells, provide an orthogonal source of risk compared to the inherited variants considered by standard polygenic scores. Third, we will develop and apply efficient computational methods for training polygenic score models on biobank-scale sequencing data. These methods will allow model-fitting to be performed on individual-level genetic data, optimizing prediction accuracy. We anticipate that these efforts will significantly improve performance of genetic risk models trained on current and future population-scale whole-genome sequencing data sets.

Key facts

NIH application ID: 10930983
Project number: 5R01HG013110-02
Recipient: BRIGHAM AND WOMEN'S HOSPITAL
Principal Investigator: Po-Ru Loh
Activity code: R01
Funding institute: NIH
Fiscal year: 2024
Award amount: $438,550
Award type: 5
Project period: 2023-09-18 → 2027-07-31