# Data-driven and science-informed methods for the discovery of biomedical mechanisms and processes

> **NIH NIH R35** · UNIVERSITY OF COLORADO · 2024 · $249,376

## Abstract

Project Summary
 Data-driven discovery methods are a novel class of methodologies and computational approaches, revolutionizing the
modeling, prediction, and control of complex systems, while remaining scientifically explainable and interpretable. These
methods learn governing equations directly from data and have found considerable success in a wide range of applications
including turbulence, climate, robotics, and autonomy. However, the first generation of these methods has proven poorly
suited to the study of biomedical data. To realize the full potential of data-driven approaches, they must be extended
and adapted to deal with the noise, sparsity, and variability intrinsic to experiments with living organisms. My group has
extended the seminal Sparse Identification of Nonlinear Dynamics (SINDy) method to the Weak form SINDy (WSINDy).
Weak form equations are a transform of the original data that enables learning of the equations even in the presence of
substantial noise and sparsity. The approach effectively recasts scientific discovery from proposing and validating/refuting
a single scientific hypothesis to simultaneously proposing (in many cases) more than 10180 hypotheses and using sparse
regressing to prune the hypotheses which are not supported by the data. Moreover, our approach currently takes on the
order of minutes on a standard laptop.
 While we have made substantial progress on improving the method, we have encountered a bottleneck for unlocking
the applications of our novel approach. Our collaborators in Biochemistry in the Liu lab are able to perform experiments
in which sheets of cells are induced to migrate collectively. Videos of these experiments can be 50GB each and their
system can quickly generate terabytes of data. While our methodology can learn governing equations for an individual
cell on the order of minutes, there can be thousands of cells in each video and the robotic experimental setup can
simultaneously performs 96 experiments. Accordingly, learning models for all cells in these experiments can take days
and a massively parallel computational resource will remove this bottleneck from our research. In particular, modern
multi-GPU systems are ideally suited to deal with this type of parallel-processing computational task.
 We recently ported our current equation learning code to pytorch (a GPU-specific machine learning software library)
and using an older model NVIDIA GPU, we able to reduce our learning time for a single cell by an order of magnitude
(from minutes to seconds). Therefore, using our updated code on the requested computational server will transform our
research program by enabling near real-time analysis of experiments.

## Key facts

- **NIH application ID:** 11100840
- **Project number:** 3R35GM149335-02S1
- **Recipient organization:** UNIVERSITY OF COLORADO
- **Principal Investigator:** David Bortz
- **Activity code:** R35 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $249,376
- **Award type:** 3
- **Project period:** 2023-09-25 → 2028-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/11100840

## Citation

> US National Institutes of Health, RePORTER application 11100840, Data-driven and science-informed methods for the discovery of biomedical mechanisms and processes (3R35GM149335-02S1). Retrieved via AI Analytics 2026-05-27 from https://api.ai-analytics.org/grant/nih/11100840. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
