Mining Thousands of Genomes to Classify Somatic and Pathogenic Structural Variants

NIH RePORTER · NIH · R01 · $578,204 · view on reporter.nih.gov ↗

Abstract

Project Summary Structural variants (SVs) have been associated with a wide range of cancers and Mendelian disorders, but complexities associated with interpretation have slowed their adoption. It is still a challenge to determine which SVs observed in a cancer patient are somatic and which SVs in a rare disease patient are pathogenetic. The SV interpretation gap is especially stark when compared to the recent progress made with single nucleotide variants (SNVs), which was driven by the release of large-scale population allele frequency estimates from gnomAD. Given that variants that lead to cancer and rare disease should be rare in the general population, the SNV allele frequency from 125 thousand samples is an extremely powerful metric. Allele frequency alone can reduce the number of potentially pathogenic variants by two orders of magnitude. Unfortunately, there is no equivalent resource for SV. There are high-quality SV call sets (SV VCFs) from large cohorts, but these static lists do not make good allele frequency references. SV detection involves extensive filtering to reduce false positives, and because filtering is never perfect, real SVs are inevitably removed making it difficult to draw a conclusion about SVs that are in patients but not in VCF. The SV could be rare and absent from the population or could have been filtered. We propose a new method (STIX) for SV characterization that dynamically searches the raw alignments from thousands of genomes for evidence supporting a putative SV. From such a search we can conclude that an SV with high-level evidence in many samples is likely to be a common variant and unlikely to be somatic or pathogenic. With this method we show that many published somatic and de novo SVs are actually present in reference populations, which implies that these variants are unlikely to cause disease. In fact, STIX is as effective as using calls from a matched-normal sample at removing germline SVs from tumor tissue calls. We also show that by relying on the raw signal, STIX recovers substantially more SVs from a cohort than its corresponding SV VCF. In addition to large-scale SV searching, we propose a robust statistical framework for estimating SV allele frequency and regional noise. We plan to make the searching technology and statics freely available for nearly 30,000 genomes through a public web interface and integration with AnVIL. If funded, this project will provide the means to accurately estimate SV population frequency by leveraging the data in tens of thousands of genomes, which will greatly increase our ability to prioritize SVs in patients and pave the way toward broader inclusion of SVs in medical genetics.

Key facts

NIH application ID: 10888229
Project number: 5R01HG011774-03
Recipient: UNIVERSITY OF COLORADO
Principal Investigator: Ryan M Layer
Activity code: R01
Funding institute: NIH
Fiscal year: 2024
Award amount: $578,204
Award type: 5
Project period: 2022-09-23 → 2027-06-30