Statistical methods for higher order dependences to understand protein functions

NIH RePORTER · NIH · R01 · $230,993 · view on reporter.nih.gov ↗

Abstract

This proposal brings together a strong team from molecular science and statistics to tackle the important problem of how to integrate protein structure and sequence information in complex systems. Some of the most important characteristics of these data are the strong correlations buried within them, with the pairwise correlations in the sequence data already being routinely used to predict structural contacts. Here, we are developing novel ways to use huge data sets to extract higher-order dependences, which are now possible with the availability of the large volumes of sequence data from genomics; and in addition, in the molecular structures such higher-order dependences are directly observable in the protein structures where groups of amino acids interact directly. Importantly, these higher-order dependences reflect the dense physical environment in the cell that requires for proper statistical characterization. A new model free information-theoretic measure is introduced to quantify the higher-order dependences, which serves as the central method in this project. By identifying the major challenges in drawing statistical inference based on this measure, we develop, evaluate, and improve a new statistical inference and computational framework for analyses of higher-order dependences with discrete data of a general type, motivated by the protein multiple sequence data. The new computationally efficient framework makes it possible to discover reliable higher-order dependences with the ability of quantifying uncertainty. The preliminary data here combine the information from sequences and structures to yield unexpected results that immediately relate to the dynamics of the protein structures. The outcome is an entirely new approach to handle the large volumes of protein sequence data and other omics data now available and the enormous volumes about to arrive on the doorsteps of omics analysts.

Key facts

NIH application ID
10378307
Project number
1R01GM144961-01
Recipient
COLORADO STATE UNIVERSITY
Principal Investigator
Wen Zhou
Activity code
R01
Funding institute
NIH
Fiscal year
2021
Award amount
$230,993
Award type
1
Project period
2021-09-23 → 2024-08-31