Machine learning approaches for improved accuracy and speed in sequence annotation

NIH RePORTER · NIH · R01 · $51,699 · view on reporter.nih.gov ↗

Abstract

Summary/Abstract Alignment of biological sequences is a key step in understanding their evolution, function, and patterns of activity. Here, we describe Machine Learning approaches to improve both accuracy and speed of highly- sensitive sequence alignment. To improve accuracy, we develop methods to reduce erroneous annotation caused by (1) the existence of low complexity and repetitive sequence and (2) the overextension of alignments of true homologs into unrelated sequence. We describe approaches based on both hidden Markov models and Artificial Neural Networks to dramatically reduce these sorts of sequence annotation error. We also address the issue of annotation speed, with development of a custom Deep Learning architecture designed to very quickly filter away large portions of candidate sequence comparisons prior to the relatively-slow sequence-alignment step. The results of these efforts will be incorporated into forks of the open source sequence alignment tools HMMER, MMSeqs, and (where appropriate) BLAST; we will also work with community developers of annotation pipelines, such as RepeatMasker and IMG/M, to incorporate these approaches. The development and incorporation into these widely used bioinformatics tools will lead to widespread impact on sequence annotation efforts.

Key facts

NIH application ID
10465048
Project number
5R01GM132600-04
Recipient
UNIVERSITY OF MONTANA
Principal Investigator
Travis John Wheeler
Activity code
R01
Funding institute
NIH
Fiscal year
2022
Award amount
$51,699
Award type
5
Project period
2019-09-20 → 2022-09-23