# Harnessing Rare Variants for Tumor Classification

> **NIH NIH R01** · SLOAN-KETTERING INST CAN RESEARCH · 2022 · $396,790

## Abstract

Abstract
This project concerns how to extract clinically actionable information for diagnostic purposes from mutational
patterns observed from tumor sequencing panels that are increasingly being used in routine medical care of
cancer patients. In recent years there has been intense scrutiny of the mutational landscape, using publicly
available databases such as The Cancer Genome Atlas and other important sources of information on somatic
mutations. However, the bulk of the attention has focused on major cancer genes, and especially the hotspot
mutations in these genes at which mutations occur frequently. However, the vast majority of somatic mutations
occur at “rare” genetic loci. Of the 1,788,153 distinct mutations that were observed in the 10,295 TCGA tumors
over 92% were singletons, i.e. mutations observed in only one tumor. Moreover, when new tumors are
sequenced, on average 60% of mutations observed are mutations that were not observed in TCGA. To date
investigators have mostly ignored this “hidden iceberg” of potential information. Our proposal is motivated by
the belief that at least a portion of these rare mutations contain important information that could be harnessed
for clinical purposes. In preliminary work we have adapted statistical methods that were developed for use in
analogous investigations in other scientific fields, such as species identification in ecology and language
processing, and have been able to demonstrate that the probabilities of observing rare variants in known
cancer genes differs markedly by gene, that these probabilities can be estimated accurately, and that for some
genes the probabilities exhibit strong lineage dependency. Motivated by these findings, we propose to broaden
the scope of these methods to investigate lineage dependency throughout the genome and to use the
information to develop accurate tools for classifying tumors by tissue site of origin. In Aim 1, we will integrate
data from various bioinformatic resources to characterize genes as well as mutations in non-coding parts of the
genome on the basis of their local GC content, DNA replication timing, transcriptional activity, chromatin
accessibility, and histone modification marks in the corresponding tissues-of-origin with a view to mapping
lineage-dependent variation in rare and previously unobserved variants. In Aim 2, we will use this information
to construct a classification tool based on a penalized hierarchical mixed-effects statistical model that permits
direct use of these “meta-features” for imputing the discriminatory effects of rare and previously unseen
variants. We will examine the predictive accuracy of the model using empirical validation datasets and study its
computational feasibility in the context of different data settings, e.g. panel sequencing versus whole-exome
and whole-genome. The ultimate goal is to create a tool for the classification of the anatomic site of origin of
cancers of unknown primary and of cancers detected th...

## Key facts

- **NIH application ID:** 10374906
- **Project number:** 5R01CA251339-02
- **Recipient organization:** SLOAN-KETTERING INST CAN RESEARCH
- **Principal Investigator:** Colin B Begg
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $396,790
- **Award type:** 5
- **Project period:** 2021-04-01 → 2024-03-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10374906

## Citation

> US National Institutes of Health, RePORTER application 10374906, Harnessing Rare Variants for Tumor Classification (5R01CA251339-02). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10374906. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
