# Leveraging biobank-scale whole-genome sequencing for polygenic risk prediction

> **NIH NIH R01** · BRIGHAM AND WOMEN'S HOSPITAL · 2024 · $438,550

## Abstract

Project Summary/Abstract
Whole-genome sequencing of population biobank cohorts holds great promise for enabling accurate prediction
of genetically-mediated risk for heritable human diseases and traits. Such information has the potential to be a
powerful resource for precision medicine, informing preventative and therapeutic decisions. To more fully
realize this potential, new statistical methods are needed to incorporate all genetic variants – including
structural variants, blood-derived somatic mutations, and rare SNPs and indels – into genetic risk models.
These classes of genetic variation, which are known to include many variants with large effects on disease
risk, can be detected in high-coverage whole-genome sequencing data now being generated at biobank scale.
However, such variants have not been accessible from previous genetic data sets (which have relied on SNP-
array genotyping and imputation). Consequently, existing methods for polygenic prediction have typically
considered only common inherited SNPs and indels.
We propose to develop a suite of statistical methods to enable these additional classes of genetic variants to
be incorporated into models of genetic risk, thereby improving predictive power. For variant types that are
currently difficult to ascertain even from whole-genome sequencing data – including somatic mutations and
some types of structural variants – we will develop new genotyping algorithms that improve statistical inference
by harnessing information across large sequenced cohorts. We will efficiently integrate information across all
variant types into genetic risk models using fast Bayesian regression methods. We will apply these approaches
to train genetic risk models for common diseases using data from very large biobank sequencing projects.
This project will have three specific aims. First, we will develop and apply methods for incorporating structural
variants into polygenic scores. Many structural variants are known to confer substantial disease risk but are at
imperfectly modeled by existing polygenic scores, such that directly including such variants will increase
prediction accuracy and cross-ancestry transferability. Second, we will develop and apply methods for
incorporating somatic mutations detectable in blood-derived DNA into genetic risk models. Such acquired
mutations, often indicative of clonal expansions of blood cells, provide an orthogonal source of risk compared
to the inherited variants considered by standard polygenic scores. Third, we will develop and apply efficient
computational methods for training polygenic score models on biobank-scale sequencing data. These methods
will allow model-fitting to be performed on individual-level genetic data, optimizing prediction accuracy. We
anticipate that these efforts will significantly improve performance of genetic risk models trained on current and
future population-scale whole-genome sequencing data sets.

## Key facts

- **NIH application ID:** 10930983
- **Project number:** 5R01HG013110-02
- **Recipient organization:** BRIGHAM AND WOMEN'S HOSPITAL
- **Principal Investigator:** Po-Ru Loh
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $438,550
- **Award type:** 5
- **Project period:** 2023-09-18 → 2027-07-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10930983

## Citation

> US National Institutes of Health, RePORTER application 10930983, Leveraging biobank-scale whole-genome sequencing for polygenic risk prediction (5R01HG013110-02). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10930983. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*