# Advanced computational methods in analyzing high-throughput sequencing data

> **NIH NIH R01** · DANA-FARBER CANCER INST · 2020 · $397,125

## Abstract

Sequencing technologies have become an essential tool to the study of human evolution, to the understanding
of the genetic bases of diseases and to the clinical detection and treatment of genetic disorders. Computational
algorithms are indispensible to the analysis of large-scale sequencing data and have received broad attention.
However, developed several years ago, many mainstream software packages for sequence alignment,
assembly and variant calling have gradually lagged behind the rapid development of sequencing technologies.
They are unable to process the latest long reads or assembled contigs, and will be outpaced by upcoming
technologies in terms of throughput. The development of advanced algorithms is critical to the applications of
sequencing technologies in the near future. This project will address this pressing need with four proposals: (1)
developing a fast and accurate aligner that accelerates short-read alignment and can map megabase-long
assemblies against large sequence collections of over 100 gigabases in size; (2) developing an integrated
caller for small sequence variations that is faster to run, more sensitive to moderately longer insertions and
more accessible to biologists without extended expertise in bioinformatics; (3) developing a generic variant
filtering tool that uses a novel deep learning model to achieve human-level accuracy on identifying false
positive calls; (4) developing a new de novo assembler that works with the latest nanopore reads of ~100
kilobases in length and may achieve good contiguity at low coverage. Upon completion, the proposed studies
will dramatically reduce the computational cost of data processing in most research labs and commercial
entities, and will enable the applications of long reads in genome assembly, in the study of structural variations
and in cancer researches.

## Key facts

- **NIH application ID:** 9870944
- **Project number:** 5R01HG010040-04
- **Recipient organization:** DANA-FARBER CANCER INST
- **Principal Investigator:** Heng Li
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $397,125
- **Award type:** 5
- **Project period:** 2018-11-16 → 2022-02-28

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9870944

## Citation

> US National Institutes of Health, RePORTER application 9870944, Advanced computational methods in analyzing high-throughput sequencing data (5R01HG010040-04). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/9870944. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
