Removing batch effects in genomic and epigenomic studies

NIH RePORTER · NIH · R01 · $325,875 · view on reporter.nih.gov ↗

Abstract

Combining genomic data sets from multiple studies is advantageous to increase statistical power in studies where logistical considerations restrict sample size or require the sequential generation of data. However, significant technical heterogeneity is commonly observed across multiple batches of data that are generated from different batches, experiments, or profiling platforms. These so called batch effects often confound true biological relationships in the data, reducing the power benefits of combining multiple batches of data, and may even lead to spurious results. Many methods have been proposed to filter technical heterogeneity and batch effects from genomic data. However, there are still significant gaps that need to be addressed to more appropriately filter technical heterogeneity from genomic datasets. For example, existing approaches assume bell-shaped, symmetric data, which are not appropriate for modern sequencing count data. Furthermore, there are no current approaches for batch effects genomic data that measure features at a refined level, for example epigenetic sequencing data, where nearby features are likely to be closely correlated. Current batch adjustment methods are dependent of the data batches on hand, meaning that if additional batches of data were added to the analysis, the batch adjustments would need to be reapplied, resulting in different adjusted genomic data values. In addition, batch correction usually introduces correlation into the adjusted data, which needs to be accounted for in downstream analyses; most researchers performing batch correction before additional analysis steps are unaware of this negative impact, and as a result often incorrectly apply downstream analysis tools. Finally, it is not always clear which batch adjustment methods should be applied in each particular case, so a thorough evaluation is required before an appropriate batch correction strategy can be devised. These gaps highlight the need for new statistical methods and interactive visualization software to facilitate the needs of researchers in this area. We propose to develop algorithms and software to address these specific research gaps facing researchers combining data from multiple experimental batches.

Key facts

NIH application ID
9926913
Project number
5R01GM127430-03
Recipient
BOSTON UNIVERSITY MEDICAL CAMPUS
Principal Investigator
William Evan Johnson
Activity code
R01
Funding institute
NIH
Fiscal year
2020
Award amount
$325,875
Award type
5
Project period
2018-05-01 → 2022-04-30