# Batch effects in molecular profiling data on cancers: detection, quantitation, interpretation, and correction

> **NIH NIH U24** · UNIVERSITY OF TX MD ANDERSON CAN CTR · 2020 · $262,991

## Abstract

Abstract: Technical batch effects pose a fundamental challenge to quality control and reproducibility of even
single-laboratory research projects, but the possibilities for serious error are greatly magnified in complex,
multi-institutional enterprises such as the cancer molecular profiling projects being undertaken by the NCI
Center for Cancer Genomics (CCG). To aid in detection, quantitation, interpretation, and (when appropriate)
correction for technical batch effects in such data, we have developed the MBatch computational tool and web
portal. MBatch has become indispensible for quality-control “surveillance” of data in The Cancer Genome Atlas
(TCGA) project, but detecting and quantitating batch effects (or trend effects or statistical outliers) are just the
first steps in a process. The next steps involve detective work in collaboration with those who generated the
data, drawing upon expertise in integrative analysis across data types, pathways, and systems-level biology.
That detective work usually succeeds in diagnosing the cause of a batch effect as technical or biological. If
technical, then computational correction can be done (judiciously).
 The primary aim of the proposed Genome Data Analysis Center (GDAC) is to translate that successful
quality-control model from TCGA to other current and future large-scale molecular profiling projects sponsored
by the CCG. We will be ready to do that on Day 1. The second aim is to increase the power of MBatch to
perform the basic quality-control functions. We will add a number of innovative new algorithms (Replicates-
Based Normalization, Empirical Bayes++, and CorNet) and increase the repertoire of standard methods. We
will also add major visualization resources including our interactive Next-Generation Clustered Heat Maps. The
third aim is to make the system sufficiently robust, user-friendly, interactive, carefully documented, and easy to
install that bench biologists and clinical researchers can use it to explore CCG-generated data or their own.
Toward those ends, we have established collaborations to implement MBatch in Galaxy and on the cloud.
We bring a number of assets to the proposed GDAC, including (i) multidisciplinary expertise in bioinformatics,
biostatistics, software engineering, biology, and clinical oncology; PIs with a combined 21 years of experience
in high-throughput molecular profiling studies of clinical cancers (in a highly consortial context); international
leadership in batch effects analysis; a highly professional software engineering team with a track record of
producing high-end, highly visual bioinformatics packages and websites; a team of 20 Analysts whose
expertise can be called on; extensive computing resources, including one of the most powerful academically-
based machines in the world; strong institutional support; close working relationships with first-class basic,
translational, and clinical researchers throughout MD Anderson, one of the foremost cancer centers i...

## Key facts

- **NIH application ID:** 9999473
- **Project number:** 5U24CA210949-05
- **Recipient organization:** UNIVERSITY OF TX MD ANDERSON CAN CTR
- **Principal Investigator:** Rehan Akbani
- **Activity code:** U24 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $262,991
- **Award type:** 5
- **Project period:** 2016-09-13 → 2021-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9999473

## Citation

> US National Institutes of Health, RePORTER application 9999473, Batch effects in molecular profiling data on cancers: detection, quantitation, interpretation, and correction (5U24CA210949-05). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/9999473. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
