DATA CORE

NIH RePORTER · NIH · U19 · $203,065 · view on reporter.nih.gov ↗

Abstract

Data Science Core - Abstract Achieving the scientific goals of the Overall Research Strategy requires a significant effort and advancement in data science for neuroscience. In particular, scientific progress depends on novel experimental design, data collection and processing (as described in Projects 1, 2, and 3), and novel analysis and models (as described in Projects 1, 2, and 3), which lead to general principles to be tested (as described in Projects 2 and 3). The fundamental goal of the Data Science Core is to accelerate the process connecting the raw data collected in all three Projects to the analyses used to obtain data derivatives, which can then be used to build models across all three Projects, and validated via electron microscopy (EM) in Project 3. The two main challenges we face to accelerate these links are big data and reproducibility. First, the data collected are too large to fit into memory, or even on disk, with each experiment ordering on one terabyte (TB), and the entire dataset amassing hundreds of TB or more. Therefore, the classic paradigm of using MATLAB for all analyses that are stored locally is not sufficient. The solution to this is twofold: 1) build scalable algorithms, so that different individuals can apply them to these big data, and 2) develop cloud data management systems, so that all consortium members can quickly access and analyze the data, and then integrate them with one another. The cloud data management system will be built on the infrastructure developed for the Open Connectome Project1, originally developed to host data on institutional resources, and ZBrain 2.0, a resource we are developing to define a common coordinate space for zebrafish brain atlasing. Second, this is a team effort, so sharing analyses and derivatives and keeping track of metadata will be important. The solution to this is threefold: 1) build a comprehensive scientific environment in the cloud, that enables sharing of entire “digital experiments”, linking to the data and ensuring that the entire analysis pipeline can be trivially run and extended by anyone and anywhere, 2) carefully curating data and metadata in existing resources, and 3) facilitating the integration of different imaging datasets to improve ZBrain 2.0. Our entire system is built on and will continue to be open source, portable and reproducible, and will use and extend best practices of data science and FAIR ( Findable, Accessible, Interoperable, and Re-usable)2 data management. Completing all the aims in this Data Science Core will not only enable and accelerate the scientific progress addressed by this proposal, it will establish new standards in data science that can be immediately applied to all other U19 efforts, as well as many other efforts within and outside NIH and even the international science effort at large.

Key facts

NIH application ID: 10918139
Project number: 5U19NS104653-08
Recipient: HARVARD UNIVERSITY
Principal Investigator: JOSHUA T VOGELSTEIN
Activity code: U19
Funding institute: NIH
Fiscal year: 2024
Award amount: $203,065
Award type: 5
Project period: 2017-09-25 → 2027-08-31