Modeling gene expression in yeast using large degenerate libraries

NIH RePORTER · NIH · R01 · $347,304 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY Short sequence elements in DNA and RNA determine the levels and composition of mRNAs and proteins, making it critical that we can accurately model how any given sequence will affect transcription, splicing or translation. Such models of cis-regulation will fill in gaps in our knowledge of these core gene expression processes. Additionally, as large numbers of human genomes are sequenced, the ability to predict the effects of sequence variation on the ultimate levels of proteins will be integral to the interpretation of variation in regulatory sequences. Similarly, the construction of metabolic pathways with defined levels of expression and the engineering of synthetic gene networks require accurate knowledge of how regulatory sequences affect expression. This application seeks to use the yeast Saccharomyces cerevisiae as a test case for learning how any short regulatory sequence affects protein levels. A predictive model will be trained on a set of libraries two orders of magnitude more complex than have been characterized to date. Libraries will be generated of a growth reporter gene with a million random sequences of 50 nucleotides that comprise either a DNA element that regulates transcription or an RNA element that regulates splicing or translation. The libraries will be transformed into yeast, and the yeast will be placed under selection such that they grow according to the ability of each random sequence to contribute to protein expression. A convolution neural network approach will be used to learn the relationship between these “fitness” phenotypes and their associated genotypes. Although yeast is a single-celled eukaryote, it has been the source of most of the original findings on gene expression, and these findings form the basis for much of our knowledge of more complex eukaryotes. Furthermore, the short sequences in yeast that comprise the DNA- and RNA-binding sites of regulatory proteins tend to be comparable in size to those of other organisms. Yeast is used often in synthetic biology and metabolic engineering, and the work proposed here will result in novel tools for quantitatively controlling its gene expression. Initial results with a library of 5' untranslated regions (UTRs) indicate that we can construct a model to account for a large fraction of the observed variability in expression, and that the model extends to native sequence elements. The model allowed us to forward engineer 5' UTRs to have increased activity. Specific aims of this application are to assess the effects of random sequences targeted to upstream regulatory elements, core promoter elements, 5' UTRs, introns and 3' UTRs; to learn predictive and interpretable models using convolutional neural networks and to identify novel functional cis-regulatory elements; and to validate our models on native sequences and combinatorial libraries, and by engineering synthetic sequence elements with user-specified properties. In sum, the proposal seeks to const...

Key facts

NIH application ID
9929602
Project number
5R01GM125809-03
Recipient
UNIVERSITY OF WASHINGTON
Principal Investigator
STANLEY FIELDS
Activity code
R01
Funding institute
NIH
Fiscal year
2020
Award amount
$347,304
Award type
5
Project period
2018-08-01 → 2022-05-31