Cross-platform structural variant discovery with deep learning

NIH RePORTER · NIH · R01 · $593,860 · view on reporter.nih.gov ↗

Abstract

Structural variants (SV) are a major driver of the genetic diversity and disease in the human genome and their discovery is imperative to advances in precision medicine and our understanding of human genetics. Due to revolutionary breakthroughs in whole-genome sequencing technologies, we now have access to genomic data at an unprecedented scale and resolution. However, despite tremendous effort and progress in SV calling methodology, general SV discovery still remains unsolved. Existing techniques use hand-engineered features and heuristics to model SV classes, relying heavily on developer expertise, which cannot scale to the vast diversity of SV types and sequencing platforms nor fully harness all the information available in raw sequencing data. As a result, these methods are usually tightly coupled to the properties of a particular sequencing technology and operate optimally only on certain SV types and sizes, rendering us blind to many other classes of SVs and their role in disease. Deep neural networks have the ability to learn complex abstractions automatically from the data and hence offer a promising avenue for general SV discovery. Deep learning has recently transformed the field of machine learning and led to remarkable advances in science and medicine. In this proposal we aim to leverage the potential of deep learning for the problem of SV detection. We lay out how to efficiently formulate SV detection as a deep learning task, and propose the development of a comprehensive framework to call and genotype SVs of different size and type, including complex and subclonal SVs, given data from a range of sequencing platforms. In particular, we demonstrate that state-of-the-art results can be obtained using our approach for short, linked, and long read datasets. In order to ensure that our models generalize across different datasets, an important goal of our proposal is also to assemble diverse and representative training data and perform extensive evaluation using publicly- available multi-platform datasets to accurately assess model performance. Our software will be built with extensibility and scalability in mind, and will be released, along with pretrained models and callsets, freely to the community.

Key facts

NIH application ID: 10453237
Project number: 1R01HG012467-01
Recipient: BROAD INSTITUTE, INC.
Principal Investigator: Victoria Popic
Activity code: R01
Funding institute: NIH
Fiscal year: 2022
Award amount: $593,860
Award type: 1
Project period: 2022-09-01 → 2027-06-30