# Improved genomic sketching for MUMmer and metagenomics

> **NIH NIH R01** · CARNEGIE-MELLON UNIVERSITY · 2024 · $409,571

## Abstract

PROJECT SUMMARY
Increasing the efﬁciency of computational methods has been instrumental to extracting insight from genomic
data. Fast aligners such as MUMMER, fast k-mer counters such as JELLYFISH, fast expression quantiﬁers such
as SAILFISH and SALMON, and high-quality efﬁcient genome assemblers such as MASURCA have been crucial
to unlocking the potential of genomic and metagenomic data. Nevertheless, computation remains a time and
cost bottleneck in many application areas. Algorithmic sketching methods, such as the minimizer schemes, have
been a useful technique for achieving improved computational efﬁciency. However, despite their importance,
these sketching techniques are understudied from a theoretical perspective and underused from a practical
perspective.
We propose to design, implement, test, and validate new sketching approaches based on signiﬁcant extensions
to the successful minimizers sketching schemes, greatly increasing the ﬂexibility of these approaches and ex-
panding their use into new areas including handling high-variance or highly repetitive sequences, and providing
a new, standard sketching toolkit for genomic method designers and software implementors. These extensions,
collectively referred to as marker selection schemes, will enable faster alignment, clustering, and assembly of
genomic sequences, and will spur further computational innovation in genomic applications.
To inform and validate this algorithmic work, we propose to enhance three important and broad areas of genomic
computational methods. First, we will extend the widely-used MUMMER aligner with a number of application-
speciﬁc “modes” that exploit these new and existing sketching schemes to achieve enhanced efﬁciency and
greater sensitivity. This will ensure continued development and enhancement for additional applications of this
important computational tool. Second, we will enhance the MASURCA genome assembler with updated in-
tegration with the new MUMMER. Third, we will use the developed marker selection schemes and additional
algorithmic ideas based on geometric embedding of sequences to develop more accurate, fast estimators of
distances between genomic sequences. These approximate distance estimators are essential for a number of
metagenomic applications including species classiﬁcation, clustering, and search. We will advance the compu-
tational accuracy of these tasks through these improved estimators.
This project will result in a deeper toolbox of genomic sketching and distance estimation algorithms, software
libraries encoding these new algorithms for wider use by the community, and an improved suite of genomic
software, including enhancements to a widely used aligner and assembler and improved accuracy in existing
and new metagenomic software.

## Key facts

- **NIH application ID:** 10850782
- **Project number:** 5R01HG012470-03
- **Recipient organization:** CARNEGIE-MELLON UNIVERSITY
- **Principal Investigator:** Carleton Lee Kingsford
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $409,571
- **Award type:** 5
- **Project period:** 2022-07-22 → 2026-05-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10850782

## Citation

> US National Institutes of Health, RePORTER application 10850782, Improved genomic sketching for MUMmer and metagenomics (5R01HG012470-03). Retrieved via AI Analytics 2026-05-26 from https://api.ai-analytics.org/grant/nih/10850782. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
