# DATA CORE

> **NIH NIH U19** · HARVARD UNIVERSITY · 2024 · $203,065

## Abstract

Data Science Core - Abstract
Achieving the scientific goals of the Overall Research Strategy requires a significant effort and advancement in
data science for neuroscience. In particular, scientific progress depends on novel experimental design, data
collection and processing (as described in Projects 1, 2, and 3), and novel analysis and models (as described in
Projects 1, 2, and 3), which lead to general principles to be tested (as described in Projects 2 and 3). The
fundamental goal of the Data Science Core is to accelerate the process connecting the raw data collected in all
three Projects to the analyses used to obtain data derivatives, which can then be used to build models across
all three Projects, and validated via electron microscopy (EM) in Project 3. The two main challenges we face to
accelerate these links are big data and reproducibility. First, the data collected are too large to fit into memory,
or even on disk, with each experiment ordering on one terabyte (TB), and the entire dataset amassing hundreds
of TB or more. Therefore, the classic paradigm of using MATLAB for all analyses that are stored locally is not
sufficient. The solution to this is twofold: 1) build scalable algorithms, so that different individuals can apply them
to these big data, and 2) develop cloud data management systems, so that all consortium members can quickly
access and analyze the data, and then integrate them with one another. The cloud data management system
will be built on the infrastructure developed for the Open Connectome Project1, originally developed to host data
on institutional resources, and ZBrain 2.0, a resource we are developing to define a common coordinate space
for zebrafish brain atlasing. Second, this is a team effort, so sharing analyses and derivatives and keeping track
of metadata will be important. The solution to this is threefold: 1) build a comprehensive scientific environment
in the cloud, that enables sharing of entire “digital experiments”, linking to the data and ensuring that the entire
analysis pipeline can be trivially run and extended by anyone and anywhere, 2) carefully curating data and
metadata in existing resources, and 3) facilitating the integration of different imaging datasets to improve ZBrain
2.0. Our entire system is built on and will continue to be open source, portable and reproducible, and will use
and extend best practices of data science and FAIR (
Findable, Accessible, Interoperable, and Re-usable)2
data
management. Completing all the aims in this Data Science Core will not only enable and accelerate the scientific
progress addressed by this proposal, it will establish new standards in data science that can be immediately
applied to all other U19 efforts, as well as many other efforts within and outside NIH and even the international
science effort at large.

## Key facts

- **NIH application ID:** 10918139
- **Project number:** 5U19NS104653-08
- **Recipient organization:** HARVARD UNIVERSITY
- **Principal Investigator:** JOSHUA T VOGELSTEIN
- **Activity code:** U19 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $203,065
- **Award type:** 5
- **Project period:** 2017-09-25 → 2027-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10918139

## Citation

> US National Institutes of Health, RePORTER application 10918139, DATA CORE (5U19NS104653-08). Retrieved via AI Analytics 2026-05-26 from https://api.ai-analytics.org/grant/nih/10918139. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*