Protein Sequence Matching

NIH RePORTER · NIH · R01 · $307,491 · view on reporter.nih.gov ↗

Abstract

Combining information from the vast body of protein sequences within the framework of protein structures enables the deeper comprehension of the complex effects of amino acid substitutions. Compiling the sequence correlations within protein structural domains will lead to better distinguishing between neutral and deleterious changes. Protein structures provide the frameworks for understanding the sequence data, through physical proximity of directly interacting amino acids and in the manifestation of allostery. This will transform sequence matching from a 1-D process to a 3-D process. Due to the rapid advances in sequencing, the large numbers of available genomes now provide hundreds of millions of protein sequences, and similar advances in structural biology now provide 100,000+ protein structures. By combining these data, our preliminary results show that accounting for the pairwise correlations in the sequence for pairs, closely interacting in the protein structures, immediately yields enhanced ability to identify similar structures by means of sequence matching. Other preliminary data show that function identification by sequence matching is also improved. Such improved homolog identification can lead to progress in structure prediction. The overarching goal here is to apply a deep knowledge of protein structure, together with the analyses of the available sequence data, to the important problem of protein sequence matching. We take an entirely new, highly innovative and uniquely multi-faceted approach for this important problem. It is well established that physical factors such as amino acid dense packing, and other physical aspects of structures affect the conservation of amino acids, and these are accounted for in the new approaches taken here to sequence matching. The rationale is that protein structures provide the physical information and the framework for improving sequence matching to incorporate aspects of 3-D structure and allostery into sequence matching. Accounting for protein flexibility and conformational dynamics will further broaden the investigated conformational space, as well as provide a better understanding of the correlations important for sequence evolution. Results from this project will improve the practice of molecular biology, particularly the identification of functions of proteins having no assigned function, and this is certain to have major impacts upon the understanding of evolution. This project will apply innovative new methods for extracting correlations in sequence, structure and dynamics, by datamining of sequences and structures. The novel structure-based approaches will enable major advances in sequence matching that will be implemented and disseminated on new web servers, made available to anyone. The outcomes of the project will enable any scientist to discriminate significantly more effectively between similar and dissimilar sequences. This better discrimination is essential for better function predic...

Key facts

NIH application ID
9851415
Project number
5R01GM127701-03
Recipient
IOWA STATE UNIVERSITY
Principal Investigator
ROBERT L JERNIGAN
Activity code
R01
Funding institute
NIH
Fiscal year
2020
Award amount
$307,491
Award type
5
Project period
2018-02-01 → 2022-01-31