# Characterizing phenotype-associated subpopulations from single-cell sequencing data

> **NIH NIH R01** · OREGON HEALTH & SCIENCE UNIVERSITY · 2024 · $308,000

## Abstract

Project Summary
Single-cell sequencing (scSeq) allows us to achieve new discoveries by distinguishing cell types, states, and
lineages from heterogeneous tissue microenvironments. However, it remains challenging to interpret complex
single-cell data from highly heterogeneous populations of cells. Currently, most existing single-cell data analyses
focus on cell type clusters defined by unsupervised clustering methods that cannot directly link cell clusters with
specific biological and clinical phenotypes. Additionally, the ever-increasing capability of scSeq in profiling
thousands to millions of cells brings more challenges of pinpointing which cell cluster for further analysis. Given
so many cells, the rationale of our "phenotype-centric" analysis is based on a Buddhist theory that "Each
individual can only drink one bottle of water from the entire river." Therefore, focusing on specific cell
subpopulations related to essential phenotypes is more important than evenly analyzing all cell clusters.
Furthermore, clinical phenotype information, such as treatment resistance, survival outcomes, cancer metastasis,
and disease stages, is primarily collected on bulk tissue samples. As a result, there is an unmet need to leverage
widely available clinical phenotype information to aid subpopulation identification from single-cell data.
Meanwhile, single-cell samples generated under different conditions require tools to identify the phenotype-
enriched subpopulations for each condition. Taken together, there is a great need for further methodological
progress for "phenotype-centric" scSeq data analysis. To this end, we propose to develop a suite of supervised-
learning-based novel methods to accurately identify the most highly phenotype-associated cell subpopulations
from scSeq data. We will (1) develop a platform with broad utilities for bulk phenotype-guided subpopulation
identification from scSeq; (2) build a novel strategy to learn high-confidence phenotype-enriched subpopulations
from scSeq data; (3) and establish a new platform for supervised phenotypic trajectory learning of subpopulations
from scSeq data. The proposed methods will be evaluated by rigorous simulations and real data analyses. This
proposal is conceptually innovative in the following aspects: (1) Our bulk phenotype-guided scSeq analysis
enables hypothesis-free identification of clinically and biologically relevant cell subpopulations from scSeq data;
(2) Our supervised learning frameworks can simultaneously select genes and identify phenotype-associated
subpopulations from scSeq data; (3) Our method to learn cell subpopulations associated with continuous
phenotypes has a unique feature to recover the hidden phenotypic stages. In summary, we expect this proposal
to deliver a suite of novel machine learning methods for "phenotype-centric" single-cell data analysis, thus
allowing us to precisely pinpoint disease-relevant subpopulations from single-cell data for cellular target
discovery.

## Key facts

- **NIH application ID:** 10838501
- **Project number:** 5R01GM147365-02
- **Recipient organization:** OREGON HEALTH & SCIENCE UNIVERSITY
- **Principal Investigator:** Zheng Xia
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $308,000
- **Award type:** 5
- **Project period:** 2023-06-01 → 2027-05-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10838501

## Citation

> US National Institutes of Health, RePORTER application 10838501, Characterizing phenotype-associated subpopulations from single-cell sequencing data (5R01GM147365-02). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10838501. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*