Manifold representations and active learning for 21 st century biology

NIH RePORTER · NIH · R35 · $359,947 · view on reporter.nih.gov ↗

Abstract

Project Summary With the rise of high-throughput sequencing and multiplexed biotechnologies enabling single-cell multi-omics and massively parallel CRISPR experiments, the biomedical community is generating a monumental amount of data. These data promise to reveal new biology and drive personal and precision medicine. However, the sheer volume of genomic data is overwhelming current computational resources, requiring prohibitively high compute time, memory usage, and storage. My lab has been at the forefront of solving big data challenges in genomics, designing novel algorithms that enable efficient and secure analyses that were previously computationally infeasible, and that reveal novel structural, cellular, and systems biology. Drawing upon our expertise in developing scalable and insightful algorithms for analyzing genomic, transcriptomic, and proteomic data, we aim to tackle two key data-driven challenges facing the biological community: 1) efficient, accurate, and robust characterization of tissues at the single-cell level, and 2) translating high-throughput datasets into biological discoveries via machine learning-based prediction. To solve the first challenge, we will leverage our discovery that seemingly high-dimensional sequencing data often lies on low-dimensional manifolds that capture the underlying biological state of interest. We will design algorithms that generate these compact, meaningful manifold representations of single-cell omics datasets. This will enable a number of key applications including characterizing co-expression and gene-modules that define healthy and pathologic cell states; integrating multi-modal single-cell omics datasets to more richly characterize cellular diversity; and investigating the mechanisms underlying transcriptomic diversity across tissues and developmental states. To solve the second challenge, we will take a two-pronged approach. First, we will design novel machine learning frameworks that provide a measure of confidence when predicting in unfamiliar biological states, enabling prediction that is robust to “out-of-distribution” (unobserved) examples. We will then work with our experimental collaborators and CROs to rapidly perform experimental validation of model-based predictions. Finally, we will return the experimental results to the model to further improve performance. This will enable an “active learning” feedback loop to efficiently explore a complex biological space for outcomes of interest. We will use this uncertainty-powered active learning approach to explore several pressing biological concerns such as the identification of small molecule compounds with enzymatic or whole-cell growth inhibitory properties, efficient design of spatial- transcriptomic experiments, computationally guided CRISPR perturbation experiments, and identification of functional non-coding mutations. This project will result in 1) numerous software tools with wide utility that efficiently analyze massive biologic...

Key facts

NIH application ID: 10401890
Project number: 5R35GM141861-02
Recipient: MASSACHUSETTS INSTITUTE OF TECHNOLOGY
Principal Investigator: BONNIE BERGER
Activity code: R35
Funding institute: NIH
Fiscal year: 2022
Award amount: $359,947
Award type: 5
Project period: 2021-06-01 → 2026-05-31