Data Science Core

NIH RePORTER · NIH · U19 · $452,908 · view on reporter.nih.gov ↗

Abstract

Data Science Core-Abstract Achieving the scientific goals of the Overall Research Strategy requires a significant effort and advancement in data science for neuroscience. In particular, scientific progress depends on novel experimental design, data collection and processing (as described in Projects 1 and 2), and novel analysis and models (as described in Project 3), which lead to general principles to be tested (as described in Project 4). The fundamental goal of the Data Science Core is to accelerate the process connecting the raw data collected in Projects 1, 2, and 4 to the analyses used to obtain data derivatives, which can then be used to build models in Project 3, and extend them in Project 4. The two main challenges we face to accelerate these links are big data and reproducibility. First, the data collected are too large to fit into memory, or even on disk, with each experiment ordering on one terabyte (TB), and the entire dataset amassing hundreds of TB or more. Therefore, the classic paradigm of using MATLAB for all analyses that are stored locally is not sufficient. The solution to this is twofold: (1) build a cloud data management system, so that all consortium members can quickly access and analyze the data, and (2) build scalable algorithms, so that different individuals can apply them to these big data. The cloud data management system will be built on the infrastructure developed for the Open Connectome Project 1​ ​, originally developed to host data on institutional resources. In the last year, the team has matured to become NeuroData (​http://neurodata.io​), porting all the infrastructure to the commercial cloud, and already hosting 20+ datasets comprising 50+ TB, including all three scales of analysis proposed here (h​ ttp://neurodata.io​). The scalable algorithms will be based on another project from NeuroData called FlashX (​http://flashx.io​). FlashX is a C++ graph analytics and machine learning library, designed to run analytics on arbitrarily large data using only a single machine (not a cluster) 2​ ,3​, and the recent recipient of a DARPA SBIR award to commercialize. We will use FlashX as a backend to support all the algorithms for processing behavior and imaging data. Second, this is a team effort, so sharing analyses and derivatives and keeping track of metadata will be important. The solution to this is to build a comprehensive scientific environment in the cloud, that enables sharing of entire “digital experiments”, linking to the data and ensuring that the entire analysis pipeline can be trivially run and extended by anyone and anywhere. This system will extend NeuroData’s “Science in the Cloud” (​http://scienceinthe.cloud​) 4​ ,5​, which recently received private funding to professionalize. Our entire system is built on and will continue to be open source, portable and reproducible, and will use and extend best practices of data science and FAIR (​ ​data management. Completing all the aims in this Data Science Find...

Key facts

NIH application ID
10241479
Project number
5U19NS104653-05
Recipient
HARVARD UNIVERSITY
Principal Investigator
JOSHUA T VOGELSTEIN
Activity code
U19
Funding institute
NIH
Fiscal year
2021
Award amount
$452,908
Award type
5
Project period
2017-09-25 → 2022-08-31