Integrating the impacts of genetic variation with massively parallel mRNA and protein barcoding

NIH RePORTER · NIH · R56 · $636,648 · view on reporter.nih.gov ↗

Abstract

SUMMARY The human genome supports billions of potential variants, and the number of variants detected through genome sequencing continues to grow rapidly. However, our ability to interpret these variants and identify those responsible for disease phenotypes still lags. Statistical methods such as genome-wide association studies struggle with rare or de novo mutations. Approaches based on sequence conservation are less reliable for the vast majority of variants that fall outside protein coding sequences. These shortcomings of traditional approaches motivate the development of models for variant stratification. Functional models aim to quantify how a genetic variant impacts a molecular phenotype -- transcription, splicing, polyadenylation, stability, translation, and more -- thus creating a link between a variant and an organismal phenotype. Crucially, such sequence-to-function models can generalize from training data to unseen sequences by learning the regulatory rules underlying the observed molecular phenotype, thus making it possible to even predict the impact of rare variants on the process under investigation. In prior work, we showed that models learned from massive numbers of synthetic reporter constructs could strongly outperform models learned from the comparably small number of natural examples even in predicting the impact of variants on gene expression in humans. Here, we extend our work combining synthetic biology with machine learning to build and integrate predictive models of mRNA stability and translation. In Specific Aim 1, we propose to conduct high-throughput reporter assays to characterize how variants in both coding and untranslated regions impact mRNA stability and translation. Regulatory rules are often position-dependent, and a particular sequence motif might have a vastly different impact on gene expression depending on its location. We thus propose to develop MPRAs that interrogate CDS as well as 5’ and 3’UTR variants. Furthermore, we propose to perform stability and translation MPRAs with multiple cell types to uncover aspects of cell type-specific regulation. In Specific Aim 2, we propose to develop machine learning approaches that enable us to integrate results from multiple measurements and generalize predictive rules learned from such assays. Additionally, we will apply a deep learning model interpretation method developed in our lab to generate hypotheses about biological mechanisms, which will be confirmed via MPRAs in knockdown cell lines. In Specific Aim 3, we propose to develop a novel type of reporter assay that allows us to directly measure the impact of sequence variation on protein levels using high-throughput nanopore sequencing of protein-level barcodes. The use of this protein barcoding technology will improve the resolution, specificity and accuracy of protein measurements by going beyond ribosome loading measurements, reducing the potential for false positives and negatives.

Key facts

NIH application ID: 11194039
Project number: 1R56HG013312-01A1
Recipient: UNIVERSITY OF WASHINGTON
Principal Investigator: Jeffrey Matthew Nivala
Activity code: R56
Funding institute: NIH
Fiscal year: 2024
Award amount: $636,648
Award type: 1
Project period: 2024-09-25 → 2025-08-31