# Machine learning approaches for improved accuracy and speed in sequence annotation

> **NIH NIH R01** · UNIVERSITY OF MONTANA · 2020 · $287,504

## Abstract

Summary/Abstract
Alignment of biological sequences is a key step in understanding their evolution, function, and patterns of
activity. Here, we describe Machine Learning approaches to improve both accuracy and speed of highly-
sensitive sequence alignment. To improve accuracy, we develop methods to reduce erroneous annotation
caused by (1) the existence of low complexity and repetitive sequence and (2) the overextension of
alignments of true homologs into unrelated sequence. We describe approaches based on both hidden
Markov models and Artificial Neural Networks to dramatically reduce these sorts of sequence annotation
error. We also address the issue of annotation speed, with development of a custom Deep Learning
architecture designed to very quickly filter away large portions of candidate sequence comparisons prior to
the relatively-slow sequence-alignment step. The results of these efforts will be incorporated into forks of the
open source sequence alignment tools HMMER, MMSeqs, and (where appropriate) BLAST; we will also
work with community developers of annotation pipelines, such as RepeatMasker and IMG/M, to incorporate
these approaches. The development and incorporation into these widely used bioinformatics tools will lead
to widespread impact on sequence annotation efforts.

## Key facts

- **NIH application ID:** 10020995
- **Project number:** 5R01GM132600-02
- **Recipient organization:** UNIVERSITY OF MONTANA
- **Principal Investigator:** Travis John Wheeler
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $287,504
- **Award type:** 5
- **Project period:** 2019-09-20 → 2023-07-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10020995

## Citation

> US National Institutes of Health, RePORTER application 10020995, Machine learning approaches for improved accuracy and speed in sequence annotation (5R01GM132600-02). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10020995. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
