# Long read based sequencing software for the comprehensive analysis of clinical samples

> **NIH NIH R44** · DNASTAR, INC. · 2020 · $750,000

## Abstract

The high cost and complexity of the analysis of whole genome resequencing remain prohibitive for most clinical
applications. Targeted resequencing allows regions of interest to be enriched from a genomic DNA sample and
sequenced to high depth allowing cost-effective identification of important variants. In combination with next-
generation sequencing (NGS), the approach has been exploited to tremendous effect in identifying candidate
genes and variants for an array of diseases and traits from cohorts and populations as well as individual clinical
samples. However, the short read nature of NGS technologies severely limits its potential to characterize, for
example, compound heterozygotes due to the lack of long range connectivity needed for haplotype phasing and
structural variants (SV). Those limitations can be overcome with long read data from Pacific Biosciences (PacBio)
or Oxford Nanopore Technologies (ONT). Moreover, new targeting methods tailored toward long read
sequencing are being developed such that a comprehensive analysis of key regions in an individual’s genome
will soon be within reach. However, an integrated software solution that is easy enough for clinical researchers
to efficiently use is sorely lacking.
 The overall goal of this Direct to Phase II proposal is to develop commercial-grade software that produces
a comprehensive catalog of annotated haplotype phased variants from clinical sequencing data and presents
them to clinical researchers through a single easy-to-use application with both analytical and genome browsing
capabilities, GenVision Ultra. The proposal focuses on augmenting our highly extensible XNG assembly pipeline
with tools necessary for fully automated detection and annotation of all classes of variants from haplotype
phased sequences. Novel adaptions to core XNG components will partition reads matching the reference from
those likely representing a SV for parallel processing (Aim 1). Matching reads will be aligned to the reference
using XNG while the putative SV-containing reads will be de novo assembled and annotated using our long read
assembler (LRA). Reference-based alignments will be phased using a novel Bayesian classifier to produce two
haplotype sequences prior to SNV/small indel calling and annotation (Aim 2). Short read polishing of the entire
assembly will be available on demand. Complete small variant and SV profiles as well as the underlying
assembly data will be accessible to the end user in GenVision Ultra. In addition, the application will have discrete
filtering and statistical tools with which to identify genes and/or variants of interest in an individual sample or
across a cohort/population (Aim 3). To ensure that the software meets the clinical sequencing market needs,
Arkana Laboratories has agreed to provide ONT and Illumina sequence data from highly curated HapMap control
samples processed with their kidney disease gene panels. Those real-world data sets together with expert
interpretati...

## Key facts

- **NIH application ID:** 10009727
- **Project number:** 1R44GM137643-01
- **Recipient organization:** DNASTAR, INC.
- **Principal Investigator:** TIMOTHY J DURFEE
- **Activity code:** R44 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $750,000
- **Award type:** 1
- **Project period:** 2020-04-01 → 2022-03-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10009727

## Citation

> US National Institutes of Health, RePORTER application 10009727, Long read based sequencing software for the comprehensive analysis of clinical samples (1R44GM137643-01). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/10009727. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*