# Tackling Big Data problems in biomedical sciences with extended similarity methods

> **NIH NIH R35** · UNIVERSITY OF FLORIDA · 2024 · $341,126

## Abstract

PROJECT SUMMARY/ABSTRACT
The overall goal of our research program is to develop new multi-purpose similarity-based tools to extract
and analyze information from very large datasets in the biomedical sciences. A central aspect of our work will be
the determination of the distance (or similarity) between different objects, a fundamental notion that pervades
many aspects of modern data science. Similarity searches are at the core of high-throughput virtual screening,
an essential task in medicinal chemistry and drug design. Comparisons also play a key role in rationalizing the
results of Molecular Dynamics (MD) simulations by helping us to identify the most important conformations of a
system, and how they contribute to its dynamic behavior. Similarity-based techniques are also essential in
spectral studies, being the foundation behind the post-processing machinery in Imaging Mass Spectrometry
(IMS). However, these applications are currently based on metrics that can only compare two objects at a time,
so comparing N objects scales quadratically, which makes them fundamentally ill-equipped to handle the amount
of data generated by state-of-the-art simulations and experiments. We recently generalized the pair-wise
comparisons, proposing extended similarity indices that allow us to compare an arbitrary number of objects
simultaneously. Our indices offer unprecedented efficiency, while also outperforming their binary counterparts in
diversity picking, feature selection, and clustering. We will leverage these advantages in three main research
directions. (1) We will develop protocols to improve the drug design process via careful exploration of the
chemical space. The extended indices will allow us to study the relations among various very large molecular
libraries, which will be key in polypharmacology and drug repurposing. They will also lead to better measures of
chemical diversity and a deeper understanding of structure-activity relations. This will serve as a guide in
generative molecular models, resulting in more robust identification of new drug leads. (2) We will present new
workflows to efficiently analyze biological ensembles. Our medoid algorithm will identify conformations close to
the folded state of a protein, while our clustering will classify the structures corresponding to other metastable
states. Alternatively, we will implement sampling techniques that will allow us to analyze very long MD
simulations. These tools can then be combined to gain a deeper understanding of various dynamical processes,
including the detailed exploration of protein folding landscapes. (3) We will develop new post-processing
techniques to aid with the interpretation of IMS data. Our similarity indices can be used to identify spatially- and
molecularly-correlated domains in tissues, without the unphysical artifacts present in other techniques. This will
allow us to track the spatial heterogeneity of metabolic processes, which is critical to the validation of I...

## Key facts

- **NIH application ID:** 10931404
- **Project number:** 5R35GM150620-02
- **Recipient organization:** UNIVERSITY OF FLORIDA
- **Principal Investigator:** Ramon Alain Miranda Quintana
- **Activity code:** R35 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $341,126
- **Award type:** 5
- **Project period:** 2023-09-21 → 2028-07-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10931404

## Citation

> US National Institutes of Health, RePORTER application 10931404, Tackling Big Data problems in biomedical sciences with extended similarity methods (5R35GM150620-02). Retrieved via AI Analytics 2026-05-26 from https://api.ai-analytics.org/grant/nih/10931404. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*