# EFFICIENT METHODS FOR CALIBRATION, CLUSTERING, VISUALIZATION AND IMPUTATION OF LARGE scRNA-seq DATA

> **NIH NIH R01** · YALE UNIVERSITY · 2021 · $400,846

## Abstract

Single cell RNA-seq (scRNA-seq) profiling provides an unprecedented opportunity to conduct detailed cellular
analysis of cell subpopulations. Fulfilling the promise of scRNA-seq for biomedical studies and biomarker
discovery requires robust computational approaches to support detection of rare phenotypes and unanticipated
cellular responses. Current approaches for imputation, calibration, clustering and visualizing of scRNA-seq
data suffer from challenges such as erroneous imputing of non-expressed genes, limitation of linear
assumptions in removal of multivariate batch effects, and inefficiencies of clustering and dimensional reduction
methods of very large datasets. We have developed spectral, neural network, and Fast Multipole Methods
(FMM) prototypes suitable for addressing these issues in the context of scRNA-seq and other high throughput
data contexts and propose to further develop and adapt these methods to scRNA-seq data analysis. Our team
of experts on data analytics and computational biology is currently funded through the NIH BD2K initiative to
develop novel big data tools and methods that have broad applicability to biomedical science. This effort
proved the feasibility of extremely efficient scalable prototypes of neural network, spectral, and harmonic
analysis techniques suitable for calibrating, reducing the dimensionality and visualizing high dimensional data,
finding intrinsic state-probability densities, and co-organizing cells, markers and samples. We propose
substantial advances over existing analytical procedures used in single cell RNA-seq studies including matrix
recovery approaches for the sparse and noisy scRNA-seq data by combining matrix completion and statistical
techniques (Aim 1A), and calibration based on our unsupervised MMD-ResNet neural network prototype and
optimal transport theory (Aim 1B). We will develop a variant of the FMM approach to speed up the calculation
of the repulsion term of the t-distributed stochastic neighbor embedding (t-SNE) visualization technique, which
will improve our current fastest t-SNE FFT-based FIt-SNE prototype, and develop new reliable approximate
nearest neighbors approaches to speed up the computation of the attraction term of t-SNE and other clustering
algorithms (Aim 2A). Our additional variants of t-SNE will be further developed to allow better separation
between clusters of cell subpopulations (late exaggeration) and better visualization using 1D t-SNE for
heatmap gene-cell representation (Aim 2A). We will adapt SpectralNet, our efficient neural network approach,
for computing graph Laplacian eigenvectors for large datasets. This will enable computation of spectral
clustering, diffusion maps and manifold learning that are utilized in many scRNA studies but are currently
limited to a moderate number of single cells (Aim 2B). Finally, we will develop a kernel based differential
abundance algorithm to characterize differences between biological conditions (Aim 2C). We will adopt
...

## Key facts

- **NIH application ID:** 10126872
- **Project number:** 5R01GM131642-03
- **Recipient organization:** YALE UNIVERSITY
- **Principal Investigator:** Yuval Kluger
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $400,846
- **Award type:** 5
- **Project period:** 2019-05-01 → 2023-01-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10126872

## Citation

> US National Institutes of Health, RePORTER application 10126872, EFFICIENT METHODS FOR CALIBRATION, CLUSTERING, VISUALIZATION AND IMPUTATION OF LARGE scRNA-seq DATA (5R01GM131642-03). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/10126872. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
