Development and Maintenance of RepeatMasker and RepeatModeler

NIH RePORTER · NIH · R01 · $576,412 · view on reporter.nih.gov ↗

Abstract

Project Summary Mammalian and most other eukaryotic genomes contain a large number of interspersed repeats (IRs), most of which are copies of transposable elements (TEs) at varying levels of decay. Their presence complicates many genome sequence analyses, but their accurate identification in an early analysis stage can reduce these complications. In addition to their pervasiveness, over the last decades the research community has become widely familiar with their enormous impact on genome activity and evolution. Every species has been exposed to a unique, complex set of TEs leaving recognizable copies from as long ago as 300 million years to as recently as the present day. These TEs are uncovered and reconstructed by de novo discovery methods, often by our RepeatModeler tool, while their copies are then annotated by our RepeatMasker software. De novo methods can create TE libraries at a reasonable pace, but the product is far from the desired quality that can be reached by hand curation. With the recent explosive growth in sequenced species, these finishing steps, perhaps never fully automatable, now form a severe bottleneck in genome analyses due to a lack of manpower and expertise, while the results, especially when produced by different methods from different research groups, lack consistency and suffer from redundancy. Furthermore, the annotation of genomes for which high-quality libraries have been created is not keeping up with library improvements due to the computational burden of re-analysis. In this proposal, we describe a plan to refactor RepeatMasker by generalizing and improving TE alignment adjudication, switching to a family-centric search strategy with support for incremental re-analysis, improving annotation reporting and supporting cluster environments. Responding to the need for improved methods for automated TE library generation we propose making significant changes to RepeatModeler’s core discovery algorithms, develop a novel model extension tool, and. In addition, we will extend our novel methods for exploiting multi-species alignments and ancestral reconstructions and utilize them to build a comprehensive mammalian TE library.

Key facts

NIH application ID
10798299
Project number
5R01HG002939-18
Recipient
INSTITUTE FOR SYSTEMS BIOLOGY
Principal Investigator
Robert MacDonald Hubley
Activity code
R01
Funding institute
NIH
Fiscal year
2024
Award amount
$576,412
Award type
5
Project period
2022-02-04 → 2027-01-31