Removing batch effects in high-throughput biomedical studies

NIH RePORTER · NIH · R01 · $284,190 · view on reporter.nih.gov ↗

Abstract

Project Summary/Abstract Combining high-throughput biomedical data sets from multiple studies is advantageous to increase statistical power in studies where logistical considerations restrict sample size or require the sequential generation of data. However, significant technical heterogeneity is commonly observed across multiple batches of data that are generated from different processing or reagent batches, experimenters, protocols, or profiling platforms. These so-called batch effects confound true relationships in the data, reducing the power benefits of combining multiple batches of data, and may even lead to spurious results. Many methods have been proposed to filter technical heterogeneity from genomic data. These methods are designed to remove batch effects, unmeasured or “surrogate” variation, or other “unwanted” variation caused by biological or technical sources. Although these approaches represent impactful advances in the field, there are still significant gaps that need to be addressed to appropriately filter technical heterogeneity from -omics data and other high-throughput datasets. For example, many existing methods assume relevant covariates are known or that raw data are generally independent. Some applications require more specific and direct correction methods, including single cell transcriptomics data that are often missing cell-type identifiers, microbiome data that are compositional in nature, and imaging and spatial transcriptomics data that have spatially correlated data points. Furthermore, batch correction introduces correlation into the adjusted data, which needs to be accounted for in downstream analyses, and most researchers performing batch correction are unaware of this negative impact and often incorrectly apply downstream analysis tools. Finally, there is still significant need for additional software tools and benchmark datasets for evaluating batch effect methods and their efficacy in specific datasets. We propose to develop algorithms and software to address these specific research gaps facing researchers combining data from multiple experimental batches.

Key facts

NIH application ID: 10935948
Project number: 5R01GM127430-07
Recipient: RUTGERS BIOMEDICAL AND HEALTH SCIENCES
Principal Investigator: William Evan Johnson
Activity code: R01
Funding institute: NIH
Fiscal year: 2024
Award amount: $284,190
Award type: 5
Project period: 2018-05-01 → 2027-08-31