Developing tools for the unbiased analysis and visualization of scRNA-seq data

NIH RePORTER · NIH · R01 · $296,743 · view on reporter.nih.gov ↗

Abstract

ABSTRACT Single-cell RNA sequencing (scRNA-seq) provides genome-wide information about gene expression at the resolution of individual cells. The unprecedented scope of these data is revolutionizing our understanding of development and tissue homeostasis as well as diseases like cancer. A major issue with scRNA-seq, however, is the shear scale of the data, consisting of ~20,000 gene expression measurements in thousands to millions of cells. Effective computational approaches are clearly required to translate data of this size and complexity into actionable biological insights. For instance, scRNA-seq data are approximately 20,000-dimensional, and as a result all available analysis pipelines rely on multiple dimensionality reduction steps. This usually entails a combination of linear tools like PCA and non-linear techniques like t-SNE and UMAP. The data is generally reduced to between 10- and 100-D for data analysis (e.g. clustering into distinct cell types) and 2-D for visualization. The problem, however, is that dimensionality reduction can lead to loss of information. We recently showed that this loss of information is dramatic: for any given cell, over 95% of its neighbors are changed in the process of dimensionality reduction. This complete change in the structure of the data can introduce significant noise and bias into the analysis, and suggests the critical need for alternative approaches. The premise of this application is that reducing bias in scRNA-seq data analysis will maximize our ability to extract meaningful information from the data. In this proposal, we focus on developing new algorithms to address three specific steps in the typical analysis pipeline: (1) Dimensionality Reduction: Our hypothesis is that deep neural networks can be explicitly trained to maximize the amount of information that can be retained for both data analysis and visualization. (2) Feature Selection: Not all genes are equally informative for downstream analyses, so researchers generally choose a subset of genes based on variation in the population. We have shown that standard approaches to selecting genes convolve true biological variation with technical noise from the experiment. We hypothesize that statistical models based on our understanding of sources of technical noise can be used to select more informative genes. (3) Cell clustering: Clustering the data to determine cell types is critical, but cells with different identities often form complex, overlapping geometries in gene expression space that are difficult for existing algorithms to resolve. Our hypothesis is that new clustering tools, guided by prior knowledge and leveraging innovations in clustering from image segmentation, can overcome this problem. We will build these new tools and test them against existing benchmark datasets and novel data generated by our experimental collaborators. We will also integrate these tools into popular scRNA-seq analysis packages. Successful completion of the pro...

Key facts

NIH application ID
10465261
Project number
5R01GM143378-02
Recipient
UNIVERSITY OF CALIFORNIA LOS ANGELES
Principal Investigator
Eric J Deeds
Activity code
R01
Funding institute
NIH
Fiscal year
2022
Award amount
$296,743
Award type
5
Project period
2021-09-01 → 2025-08-31