# Predictive Modeling of Alternative Splicing and Polyadenylation from Millions of Random Sequences

> **NIH NIH R01** · UNIVERSITY OF WASHINGTON · 2020 · $573,285

## Abstract

The proportion of the human genome that underlies gene regulation dwarfs the proportion that encodes
proteins. However, we remain poorly equipped for identifying which genetic variants compromise gene
regulatory function in ways that may contribute to risk for both rare and common human diseases.
Understanding how non-coding sequences regulate gene expression, as well as being able to predict the
functional consequences of genetic variation for gene regulation, are paramount challenges for the field. Here,
we propose to combine synthetic biology, massively parallel functional assays, and machine learning to
profoundly advance our understanding of the `regulatory code' of the human genome. While challenging, the
task of unravelling complex codes from large amounts of empirical data is not without precedent. For example,
over the past decade, computer scientists working in natural language processing have made immense
progress, driven in large part by a combination of algorithmic and computational improvements and
enormously larger training datasets than were available to the previous generations of scientists working in this
area. Inspired by the revolutionizing impact of “big data” for traditional problems in machine learning, we
propose to model gene regulatory phenomena using training datasets with several orders of magnitude more
examples than naturally exist in the human genome. We predict that the models learned from massive
numbers of synthetic examples will strongly outperform models learned from the small number of natural
examples. We will demonstrate our approach by developing comprehensive, quantitative, and predictive
models for alternative splicing and alternative polyadenylation, two widespread regulatory mechanisms by
which a single gene can code for multiple transcripts and proteins. However, we anticipate that this basic
paradigm – specifically, the massively parallel measurement of the functional behavior of extremely large
numbers of synthetic sequences followed by quantitative modeling of sequence-function relationships – can be
generalized to advance our understanding of diverse forms of gene regulation.

## Key facts

- **NIH application ID:** 9869019
- **Project number:** 5R01HG009136-04
- **Recipient organization:** UNIVERSITY OF WASHINGTON
- **Principal Investigator:** Georg Seelig
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $573,285
- **Award type:** 5
- **Project period:** 2017-04-21 → 2022-01-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9869019

## Citation

> US National Institutes of Health, RePORTER application 9869019, Predictive Modeling of Alternative Splicing and Polyadenylation from Millions of Random Sequences (5R01HG009136-04). Retrieved via AI Analytics 2026-05-21 from https://api.ai-analytics.org/grant/nih/9869019. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*