# Data Analysis Tools for Emerging High-Throughput Technologies

> **NIH NIH R35** · DANA-FARBER CANCER INST · 2021 · $596,782

## Abstract

Project Summary
Biomedical research and the basic sciences are increasingly dependent on high-throughput technologies that have the
ability to simultaneously measure thousands of nucleic acid molecules in a sample. In combination with ingenious
laboratory protocols, these technologies have permitted unprecedented ways of studying the molecular basis of
disease and phenotypic variation. As a result of the increasing adoption of these technologies, more investigations
rely on complex datasets and require the development of new statistical techniques to adequately interpret data.
 Today, high-throughput technologies applications go far beyond their original task of studying DNA sequence
itself and also include the measurement of quantitative and dynamic outcomes such as gene expression levels and
DNA methylation (DNAm) status. These quantitative and dynamic outcomes introduce levels of variability that
give rise to further data analytic challenges related to distinguishing unwanted sources of variability from bio-
logically relevant signals. Furthermore, when measuring these quantitative outcomes, data are subject to severe
technological and biological biases that can substantially impact downstream analyses. Our group has previously
demonstrated that statistical methodology can provide great improvements over ad-hoc algorithms oﬀered as de-
faults by technology developers. Our highly cited statistical methodology and our widely used software demonstrate
the success of our work.
 The National Research Council's Frontiers in Massive Data Analysis publication states that, “the challenges
for massive data go beyond the storage, indexing, and querying that have been the province of classical database
systems and instead hinge on the ambitious goal of inference”. Inference is particularly relevant in biomedical
applications since we often look to draw conclusions based on observed diﬀerences between groups in the presence
of within group variability. Two particularly challenging tasks relate to performing valid inference when 1) we
perform scans over large spaces to identify small regions of interests and 2) the data is aﬀected by unexpected
systematic bias or batch eﬀects. We will focus on these two general challenges. Our speciﬁc proposal is to work on
the most urgent needs of researchers facing new challenges as they increasingly rely on high-throughput techniques.
We will leverage the expertise of our collaborators to prioritize projects. We greatly appreciate the ﬂexibility
permitted by the R35 mechanism as it will help us maximize the impact of our work.

## Key facts

- **NIH application ID:** 10159937
- **Project number:** 5R35GM131802-03
- **Recipient organization:** DANA-FARBER CANCER INST
- **Principal Investigator:** Rafael Angel Irizarry
- **Activity code:** R35 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $596,782
- **Award type:** 5
- **Project period:** 2019-05-01 → 2024-04-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10159937

## Citation

> US National Institutes of Health, RePORTER application 10159937, Data Analysis Tools for Emerging High-Throughput Technologies (5R35GM131802-03). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10159937. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
