# Advanced correlation analyses to infer sequence and structural determinants of protein function

> **NIH NIH R01** · UNIVERSITY OF MARYLAND BALTIMORE · 2020 · $309,000

## Abstract

PROJECT SUMMARY
A long-term goal of molecular biology is assigning functional and mechanistic roles to specific protein residues,
beyond the obvious roles in catalysis. Although this task is hindered by the relative sparsity of experimentally-
based sequence annotations, it is facilitated by an abundance of sequence data augmented by structural data.
This has spurred sequence- and structure-based prediction of function determining residues using a wide
variety of methods. However, by focusing on experimentally characterized functions, these methods disfavor
recognition of residues involved in important uncharacterized functions, insofar as these will be benchmarked
incorrectly as false positives. Instead, this project focuses more generally on inferring functionally-relevant
residues (FRRs) by allowing the sequence data itself to reveal its most statistically surprising properties
without making assumptions about what will be found. We argue that, in the absence of experimental
annotations, it is only possible to directly link individual residues to other residues and such residue sets to
structural features. This project will make such associations by identifying sequence-to-sequence and
sequence-to-structure correlations, and will focus solely on the observed data rather than on predicting
(unseen) biochemical properties. The goal is to obtain hypothesis-generating observations for experimental
follow up. Aim 1 will create advanced tools for characterizing correlated residue patterns due to functional
divergence with each pattern consisting of an arbitrary number of residues. Aim 2 will develop a tool to
probabilistically assess correlations between independent sequence- and structurally-defined residue sets.
This tool will be modified for other purposes, including the evaluation of FRR-prediction programs. Aim 3 will
integrate Aims 1 & 2 methods and direct coupling analysis (DCA) into a nearly comprehensive system for
sequence/structural correlation analysis. (Unlike the correlations under Aims 1 & 2, DCA focuses on direct
correlations between residue pairs.) This strategy involves a high degree of model complexity and optimization
over diverse sequence properties synergistically (due to interrelationships and dependencies) and over
alternative models and parameters; hence, considerable care is required to ensure reliable results. Therefore,
we will apply information theoretical principles to adjust accurately for multiple hypotheses, to avoid under- and
over-fitting to the data, and to eliminate inherent biases. Aim 3 will also characterize the relationships among
the various types of correlations. We will apply these tools to large, functionally diverse superfamilies in
collaboration with researchers interested in these proteins. Using tools developed under Aim 2 and hundreds
of conserved domain datasets, Aim 4 will rigorously benchmark the performance of tools developed under
Aims 1 & 3 relative to competing methods. This project will aid r...

## Key facts

- **NIH application ID:** 9849288
- **Project number:** 5R01GM125878-03
- **Recipient organization:** UNIVERSITY OF MARYLAND BALTIMORE
- **Principal Investigator:** ANDREW F NEUWALD
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $309,000
- **Award type:** 5
- **Project period:** 2018-02-01 → 2022-01-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9849288

## Citation

> US National Institutes of Health, RePORTER application 9849288, Advanced correlation analyses to infer sequence and structural determinants of protein function (5R01GM125878-03). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/9849288. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
