Predictive Modeling of Alternative Splicing and Polyadenylation from Millions of Random Sequences

NIH RePORTER · NIH · R01 · $573,285 · view on reporter.nih.gov ↗

Abstract

The proportion of the human genome that underlies gene regulation dwarfs the proportion that encodes proteins. However, we remain poorly equipped for identifying which genetic variants compromise gene regulatory function in ways that may contribute to risk for both rare and common human diseases. Understanding how non-coding sequences regulate gene expression, as well as being able to predict the functional consequences of genetic variation for gene regulation, are paramount challenges for the field. Here, we propose to combine synthetic biology, massively parallel functional assays, and machine learning to profoundly advance our understanding of the `regulatory code' of the human genome. While challenging, the task of unravelling complex codes from large amounts of empirical data is not without precedent. For example, over the past decade, computer scientists working in natural language processing have made immense progress, driven in large part by a combination of algorithmic and computational improvements and enormously larger training datasets than were available to the previous generations of scientists working in this area. Inspired by the revolutionizing impact of “big data” for traditional problems in machine learning, we propose to model gene regulatory phenomena using training datasets with several orders of magnitude more examples than naturally exist in the human genome. We predict that the models learned from massive numbers of synthetic examples will strongly outperform models learned from the small number of natural examples. We will demonstrate our approach by developing comprehensive, quantitative, and predictive models for alternative splicing and alternative polyadenylation, two widespread regulatory mechanisms by which a single gene can code for multiple transcripts and proteins. However, we anticipate that this basic paradigm – specifically, the massively parallel measurement of the functional behavior of extremely large numbers of synthetic sequences followed by quantitative modeling of sequence-function relationships – can be generalized to advance our understanding of diverse forms of gene regulation.

Key facts

NIH application ID
9869019
Project number
5R01HG009136-04
Recipient
UNIVERSITY OF WASHINGTON
Principal Investigator
Georg Seelig
Activity code
R01
Funding institute
NIH
Fiscal year
2020
Award amount
$573,285
Award type
5
Project period
2017-04-21 → 2022-01-31