Dfam: sustainable growth, curation support, and improved quality for mobile element annotation

NIH RePORTER · NIH · U24 · $529,832 · view on reporter.nih.gov ↗

Abstract

Project Summary / Abstract Repetitive DNA, especially that due to transposable elements (TEs), makes up a large fraction of many genomes. Thorough and accurate annotation of repetitive content in genomes depends on a comprehensive database of known TEs, along with robust statistical and procedural methods for recognizing decayed instances of elements and disentangling their complex relationships. Annotation of TE instances is usually performed using our RepeatMasker software, which compares a genome to a database containing representations of known repeat families. These have historically been consensus sequences, which generally approximate the sequences of the original TEs. Our Dfam database is an open access collection of repetitive DNA families, in which each family is represented by a multiple sequence alignment and a profile hidden Markov model (HMM). We have demonstrated that profile HMMs support improved annotation sensitivity, and Dfam provides numerous aids to both curators of TE families and those who make use of the resulting annotations. During the life of this grant, the database has grown to include families belonging to more than 1000 species (from a baseline of 5). This growth has introduced a number of scale-based pressures, which in some cases have forced us to reduce Dfam functionality in response, and in other cases highlighted ways that the resource can better meet the needs of the community. Our proposed efforts largely target these matters while continuing to expand and diversify the resource.

Key facts

NIH application ID: 10929987
Project number: 5U24HG010136-07
Recipient: INSTITUTE FOR SYSTEMS BIOLOGY
Principal Investigator: Robert MacDonald Hubley
Activity code: U24
Funding institute: NIH
Fiscal year: 2024
Award amount: $529,832
Award type: 5
Project period: 2018-08-15 → 2028-06-30