Feature Engineering to Infer Proteomic Changes from mRNA Data

NIH RePORTER · NIH · R21 · $435,406 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY Proteins are the molecules that carry out the majority of biological function. Although mRNA levels can be measured at scale and have been transformative for understanding gene expression in large cohorts, mRNA levels correlate only partially with protein levels in a system. As a result, some differentially expressed genes from transcriptomics experiments may not be informative for the abundance of their proteins, leaving their functional significance difficult to interpret and leading to a loss of information that impedes the translation of `omics' experiments to biological knowledge. The recent availability of large matching transcriptomics and proteomics data has created new avenues to predict protein level changes from mRNA profiles using machine learning methods. Results from these efforts have highlighted the prevalence of post-transcriptional regulation of the proteome, where the abundance of a protein species in a sample is often determined not only by its own coding mRNA, but the abundance of other mRNAs in the transcriptomes, including many of those coding for its protein-protein interaction partners. Accordingly, this project aims to explore new strategies that capture protein-protein relationships to enhance our current capability to infer protein-level changes from mRNA abundance measurements. Specifically, Aim 1 will explore the use of conceptual embeddings of proteins to create low-dimension vectors that capture relevant protein information on: (1) the topology of protein-protein interaction network measured in large mass spectrometry experiments, and (2) protein sequence, domain, and structure information; and then evaluate their utility for capturing the relevant protein neighborhoods that aid in the prediction of proteomic changes from mRNA abundance. In parallel, Aim 2 will aim to disseminate technological advances by building enabling software tools and web apps that will take the pre-trained models to analyze new user input mRNA sequencing results, which are designed to assist in the prioritization and interpretation of gene lists from sequencing experiments. The models will be validated by mass spectrometry and immunoblot experiments. If successful, the proposed work will lead to broadly applicable software tools that can enhance the utility and interpretation of transcriptomics and proteomics experiments. It may also yield new insights into the biological factors that contribute to non-correlation between mRNA and proteins.

Key facts

NIH application ID: 10949467
Project number: 1R21HG013684-01
Recipient: UNIVERSITY OF COLORADO DENVER
Principal Investigator: Edward Lau
Activity code: R21
Funding institute: NIH
Fiscal year: 2024
Award amount: $435,406
Award type: 1
Project period: 2024-09-23 → 2026-08-31