Data-driven and science-informed methods for the discovery of biomedical mechanisms and processes

NIH RePORTER · NIH · R35 · $249,376 · view on reporter.nih.gov ↗

Abstract

Project Summary Data-driven discovery methods are a novel class of methodologies and computational approaches, revolutionizing the modeling, prediction, and control of complex systems, while remaining scientifically explainable and interpretable. These methods learn governing equations directly from data and have found considerable success in a wide range of applications including turbulence, climate, robotics, and autonomy. However, the first generation of these methods has proven poorly suited to the study of biomedical data. To realize the full potential of data-driven approaches, they must be extended and adapted to deal with the noise, sparsity, and variability intrinsic to experiments with living organisms. My group has extended the seminal Sparse Identification of Nonlinear Dynamics (SINDy) method to the Weak form SINDy (WSINDy). Weak form equations are a transform of the original data that enables learning of the equations even in the presence of substantial noise and sparsity. The approach effectively recasts scientific discovery from proposing and validating/refuting a single scientific hypothesis to simultaneously proposing (in many cases) more than 10180 hypotheses and using sparse regressing to prune the hypotheses which are not supported by the data. Moreover, our approach currently takes on the order of minutes on a standard laptop. While we have made substantial progress on improving the method, we have encountered a bottleneck for unlocking the applications of our novel approach. Our collaborators in Biochemistry in the Liu lab are able to perform experiments in which sheets of cells are induced to migrate collectively. Videos of these experiments can be 50GB each and their system can quickly generate terabytes of data. While our methodology can learn governing equations for an individual cell on the order of minutes, there can be thousands of cells in each video and the robotic experimental setup can simultaneously performs 96 experiments. Accordingly, learning models for all cells in these experiments can take days and a massively parallel computational resource will remove this bottleneck from our research. In particular, modern multi-GPU systems are ideally suited to deal with this type of parallel-processing computational task. We recently ported our current equation learning code to pytorch (a GPU-specific machine learning software library) and using an older model NVIDIA GPU, we able to reduce our learning time for a single cell by an order of magnitude (from minutes to seconds). Therefore, using our updated code on the requested computational server will transform our research program by enabling near real-time analysis of experiments.

Key facts

NIH application ID: 11100840
Project number: 3R35GM149335-02S1
Recipient: UNIVERSITY OF COLORADO
Principal Investigator: David Bortz
Activity code: R35
Funding institute: NIH
Fiscal year: 2024
Award amount: $249,376
Award type: 3
Project period: 2023-09-25 → 2028-08-31