# A statistical framework to systematically characterize cancer driver mutations in noncoding genomic regions

> **NIH NIH R21** · DANA-FARBER CANCER INST · 2020 · $177,000

## Abstract

PROJECT SUMMARY
Cancer genomes typically harbor a substantial number of somatic mutations. Relatively few driver mutations
actually alter the function of proteins in tumor cells, whereas most mutations are considered to be functionally
neutral passenger mutations. Over the past decade, the search for cancer driver mutations has focused on
coding regions and several mutational significance algorithms have been developed for coding mutations. The
contribution of mutations in noncoding regulatory regions to tumor formation largely remains unknown and
current mutational significance algorithms are not designed to detect driver mutations in noncoding regions, due
to biological differences between coding and noncoding mutations. The emerging availability of large whole-
genome sequencing datasets (e.g. PCAWG and HMF datasets) creates an ample opportunity to develop new
mutational significance algorithms that are particularly designed for the interpretation of noncoding regions.
Recently, we have developed a new statistical approach that identifies driver mutations in coding regions based
on the nucleotide context. Critically, consideration of the nucleotide context around mutations does not require
prior knowledge for functional consequences associated with these mutations. Hence, we hypothesize that
generalizing our nucleotide context model to noncoding regions will uncover novel noncoding driver
mutations that cannot be detected using the mutational significance approaches currently available. For
this purpose, we will develop a statistical framework that incorporates the biological differences between coding
and noncoding mutations and that is specifically designed to detect driver mutations in noncoding regions.
Specifically, we will consider the context-dependent distribution of passenger mutations, modeling of the
background mutation rate, accurately partition the background mutation rate, model the sequence composition
of the reference genome, and account for coverage fluctuation. We will then combine these statistical
components by computing an independent product of their underlying probabilities. We will derive a significance
p-value using a Monte-Carlo simulation approach, and use FDR for multiple hypothesis test correction. This
strategy will allow us to accurately estimate the significance of somatic mutations in noncoding genomic regions.
We will next apply this statistical framework to whole-genome sequencing data of 5,523 tumor patients, thereby
deriving a comprehensive list of candidate driver mutations in noncoding regions. Finally, we will investigate
whether noncoding mutations are overrepresented in transcription factor binding sites, regulate gene expression
levels, induce alternative splicing, or affect epigenomic states. Upon the completion of this project, we will have
developed and applied a statistical framework for discovery of significant somatic mutations in noncoding
regions, and defined the mutational landscape of the no...

## Key facts

- **NIH application ID:** 10260680
- **Project number:** 3R21CA242861-02S1
- **Recipient organization:** DANA-FARBER CANCER INST
- **Principal Investigator:** Eliezer M Van Allen
- **Activity code:** R21 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $177,000
- **Award type:** 3
- **Project period:** 2019-07-01 → 2021-12-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10260680

## Citation

> US National Institutes of Health, RePORTER application 10260680, A statistical framework to systematically characterize cancer driver mutations in noncoding genomic regions (3R21CA242861-02S1). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10260680. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
