# Exploring the unknown protein universe using evolutionary information

> **NIH NIH DP5** · HARVARD UNIVERSITY · 2020 · $422,500

## Abstract

Project Summary/Abstract:
 For billions of years, nature has been conducting the greatest experiment of all time.
Imagine one day gaining access to the detailed notes from these experiments. Today, with
worldwide expeditions to collect samples from all habitats, single-cell sequencing of
unculturable microbes and the rapid drop in sequencing costs, we can finally tap into nature and
gain access to these notes. All that is missing is a Rosetta Stone to interpret this data.
 The traditional approach, to interpreting sequence data, is through comparison to known
information, such as annotated genomes and/or experimentally characterized protein families.
Unfortunately, nearly half of metagenomic data (coming from either environmental samples or
microbiomes) lacks any detectable sequence homology to any protein family, let alone to any
isolated genome. Furthermore, the rate at which this “dark matter” is discovered, far exceeds
the rate at which experiments can be done to characterize it.
 An alternative approach is to learn a generative, statistical model of the evolutionary
process itself. The parameters of this model should in turn provide the constraints on natural
selection. For protein-coding genes, the constraints includes folding, stability, and function.
Recently, it was shown that a global statistical model of a protein family that captures both
conservation and coevolution patterns in the family possesses this quality. The strength of
coevolution term is correlated with residue-residue contacts in 3D structure. These contacts
have since been used to computationally determine the 3D structures of hundreds of unknown
protein families and complexes. These in turn, have been used to predict the function by looking
at arrangement of conserved residues and structural similarity to known protein structures.
Structural matches can occur in the absence of detectable sequence similarity because
structural similarity is retained over larger evolutionary distances.
 I propose to 1) Develop an improved, unified, statistical model of protein evolution that
takes into account functional and lineage constraints; 2) Apply the model to mine metagenomic
“dark matter” sequences for new protein families, functions and protein-protein interactions; 3)
Probe evolution of multicellularity through comparison of structures and interactions in the early
tree of life. One of the results of the research will be a public database of new protein families
and their predicted 3D structure and function. These will be used by structural, molecular and
evolutionary biologists as a reference for future studies into the unknown protein universe.

## Key facts

- **NIH application ID:** 9990892
- **Project number:** 5DP5OD026389-03
- **Recipient organization:** HARVARD UNIVERSITY
- **Principal Investigator:** Sergey L. Ovchinnikov
- **Activity code:** DP5 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $422,500
- **Award type:** 5
- **Project period:** 2018-09-07 → 2023-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9990892

## Citation

> US National Institutes of Health, RePORTER application 9990892, Exploring the unknown protein universe using evolutionary information (5DP5OD026389-03). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/9990892. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*