# Mining Thousands of Genomes to Classify Somatic and Pathogenic Structural Variants

> **NIH NIH R01** · UNIVERSITY OF COLORADO · 2022 · $578,543

## Abstract

Project Summary
Structural variants (SVs) have been associated with a wide range of cancers and Mendelian disorders, but
complexities associated with interpretation have slowed their adoption. It is still a challenge to determine which
SVs observed in a cancer patient are somatic and which SVs in a rare disease patient are pathogenetic. The
SV interpretation gap is especially stark when compared to the recent progress made with single nucleotide
variants (SNVs), which was driven by the release of large-scale population allele frequency estimates from
gnomAD. Given that variants that lead to cancer and rare disease should be rare in the general population, the
SNV allele frequency from 125 thousand samples is an extremely powerful metric. Allele frequency alone can
reduce the number of potentially pathogenic variants by two orders of magnitude. Unfortunately, there is no
equivalent resource for SV.
 There are high-quality SV call sets (SV VCFs) from large cohorts, but these static lists do not make
good allele frequency references. SV detection involves extensive filtering to reduce false positives, and
because filtering is never perfect, real SVs are inevitably removed making it difficult to draw a conclusion about
SVs that are in patients but not in VCF. The SV could be rare and absent from the population or could have
been filtered.
 We propose a new method (STIX) for SV characterization that dynamically searches the raw
alignments from thousands of genomes for evidence supporting a putative SV. From such a search we can
conclude that an SV with high-level evidence in many samples is likely to be a common variant and unlikely to
be somatic or pathogenic. With this method we show that many published somatic and de novo SVs are
actually present in reference populations, which implies that these variants are unlikely to cause disease. In
fact, STIX is as effective as using calls from a matched-normal sample at removing germline SVs from tumor
tissue calls. We also show that by relying on the raw signal, STIX recovers substantially more SVs from a
cohort than its corresponding SV VCF.
 In addition to large-scale SV searching, we propose a robust statistical framework for estimating SV
allele frequency and regional noise. We plan to make the searching technology and statics freely available for
nearly 30,000 genomes through a public web interface and integration with AnVIL. If funded, this project will
provide the means to accurately estimate SV population frequency by leveraging the data in tens of thousands
of genomes, which will greatly increase our ability to prioritize SVs in patients and pave the way toward
broader inclusion of SVs in medical genetics.

## Key facts

- **NIH application ID:** 10453323
- **Project number:** 1R01HG011774-01A1
- **Recipient organization:** UNIVERSITY OF COLORADO
- **Principal Investigator:** Ryan M Layer
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $578,543
- **Award type:** 1
- **Project period:** 2022-09-23 → 2027-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10453323

## Citation

> US National Institutes of Health, RePORTER application 10453323, Mining Thousands of Genomes to Classify Somatic and Pathogenic Structural Variants (1R01HG011774-01A1). Retrieved via AI Analytics 2026-05-26 from https://api.ai-analytics.org/grant/nih/10453323. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*