HMMER and Infernal: Finding distant homologs of sequences and RNA structures

NIH RePORTER · NIH · R01 · $533,450 · view on reporter.nih.gov ↗

Abstract

Project Summary/Abstract Genome sequence data is now available for hundreds of thousands of species. Our ability to exploit this vast trove of information about the molecular basis and evolution of life depends on sophisticated computational analysis tools. One important class of tools is profile analysis software, for making consensus statistical models of multiple alignments of biological sequence families, and for using those models to sensitively detect homologs and make deep multiple alignments. Profile analysis derives its power from the fact that despite the unbounded growth of sequence data, the majority of functional sequences can be condensed into a manageably small number of conserved families. Profile software underlies numerous protein, RNA, and DNA sequence family databases. The systematic availability of deep multiple alignments (of many thousands of sequences) is enabling revolutionary advances in predicting molecular function and 3D structure by comparative sequence analysis. The HMMER and Infernal software packages from our laboratory are some of the most widely used tools for profile analysis. HMMER implements profile hidden Markov models (profile HMMs) of primary sequence consensus, typically for protein domains and conserved DNA elements. Infernal implements profile stochastic context-free grammars (profile SCFGs) of RNA secondary structure and sequence consensus. In the context of the continued development of these packages, this proposal has three specific aims for new lines of research that we expect to lead to major improvements in the accuracy, utility, and computational efficiency of profile anal- ysis. The first aim proposes to develop a discontinuous Markov model of nonhomologous sequences, to improve the ability to distinguish homologs from nonhomologs and reduce the false positive rate of database searches. The second aim proposes to develop sketching methods for efficiently representing the voluminous results of a database homology search with a subset of the most phylogenetically informative hits. The third aim proposes to develop adaptive computation methods to flexibly harness the complex mix of CPU/GPU processors, mem- ory, and storage in modern hardware architectures, enabling efficient scalable computation and near-interactive database search times.

Key facts

NIH application ID
10487574
Project number
5R01HG009116-06
Recipient
HARVARD UNIVERSITY
Principal Investigator
Sean R Eddy
Activity code
R01
Funding institute
NIH
Fiscal year
2022
Award amount
$533,450
Award type
5
Project period
2016-09-16 → 2026-06-30