Characterizing phenotype-associated subpopulations from single-cell sequencing data

NIH RePORTER · NIH · R01 · $308,000 · view on reporter.nih.gov ↗

Abstract

Project Summary Single-cell sequencing (scSeq) allows us to achieve new discoveries by distinguishing cell types, states, and lineages from heterogeneous tissue microenvironments. However, it remains challenging to interpret complex single-cell data from highly heterogeneous populations of cells. Currently, most existing single-cell data analyses focus on cell type clusters defined by unsupervised clustering methods that cannot directly link cell clusters with specific biological and clinical phenotypes. Additionally, the ever-increasing capability of scSeq in profiling thousands to millions of cells brings more challenges of pinpointing which cell cluster for further analysis. Given so many cells, the rationale of our "phenotype-centric" analysis is based on a Buddhist theory that "Each individual can only drink one bottle of water from the entire river." Therefore, focusing on specific cell subpopulations related to essential phenotypes is more important than evenly analyzing all cell clusters. Furthermore, clinical phenotype information, such as treatment resistance, survival outcomes, cancer metastasis, and disease stages, is primarily collected on bulk tissue samples. As a result, there is an unmet need to leverage widely available clinical phenotype information to aid subpopulation identification from single-cell data. Meanwhile, single-cell samples generated under different conditions require tools to identify the phenotype- enriched subpopulations for each condition. Taken together, there is a great need for further methodological progress for "phenotype-centric" scSeq data analysis. To this end, we propose to develop a suite of supervised- learning-based novel methods to accurately identify the most highly phenotype-associated cell subpopulations from scSeq data. We will (1) develop a platform with broad utilities for bulk phenotype-guided subpopulation identification from scSeq; (2) build a novel strategy to learn high-confidence phenotype-enriched subpopulations from scSeq data; (3) and establish a new platform for supervised phenotypic trajectory learning of subpopulations from scSeq data. The proposed methods will be evaluated by rigorous simulations and real data analyses. This proposal is conceptually innovative in the following aspects: (1) Our bulk phenotype-guided scSeq analysis enables hypothesis-free identification of clinically and biologically relevant cell subpopulations from scSeq data; (2) Our supervised learning frameworks can simultaneously select genes and identify phenotype-associated subpopulations from scSeq data; (3) Our method to learn cell subpopulations associated with continuous phenotypes has a unique feature to recover the hidden phenotypic stages. In summary, we expect this proposal to deliver a suite of novel machine learning methods for "phenotype-centric" single-cell data analysis, thus allowing us to precisely pinpoint disease-relevant subpopulations from single-cell data for cellular target discovery.

Key facts

NIH application ID: 10838501
Project number: 5R01GM147365-02
Recipient: OREGON HEALTH & SCIENCE UNIVERSITY
Principal Investigator: Zheng Xia
Activity code: R01
Funding institute: NIH
Fiscal year: 2024
Award amount: $308,000
Award type: 5
Project period: 2023-06-01 → 2027-05-31