# Subtyping complex phenotypes via constrastive learning by leveraging electronic health records

> **NIH NIH R21** · UNIVERSITY OF CALIFORNIA LOS ANGELES · 2023 · $428,448

## Abstract

Summary
A critical step towards realizing the promise of precision medicine is the identification of biologically- and
clinically-relevant disease subtypes. Disease subtypes are suspected yet unknown or not fully characterized
for many conditions, including obesity, diabetes, hypertension, asthma, dementia, and psychiatric disorders.
The existence of “phenotypic heterogeneity” has practical and clinical implications: undifferentiated cases of a
disease may represent the action of a variety of underlying causal processes, each of which may have a
different prognosis or respond to a different treatment. Existing phenotype subtyping methods predominantly
rely on the idea that applying clustering or dimensionality reduction techniques to high-dimensional data from
patients with a given condition may reveal explanatory patterns that correspond to disease subtypes. This
implicitly assumes that biologically meaningful subtypes can be captured by the dominant axes of variation in
the data. Yet, the most dominant sources of variation are expected to be independent of biologically
meaningful subtypes in many settings. In this project, a novel contrastive learning method is proposed for
learning a heterogeneity gradient of variation that is specific to cases of a given condition and cannot be found
in matched controls. Electronic health records (EHR) and survey information from the rich All of Us database is
expected to span the spectrum of clinical heterogeneity across common complex diseases, which can inform
the proposed method about meaningful sub-phenotypic variation for many diseases. The subtypes identified
will be evaluated within the All of Us database and replicated using three external EHR cohorts for subtype-
specific genetic effects, clinical risk factors, and clinical trajectories. Finally, EHR-based models are notoriously
known for their susceptibility to poor generalization on out-of-distribution data that represent locations,
populations, medical practices, or other factors that were not represented in the training data. This challenge
will be addressed by developing a domain generalization framework, which will allow learning disease
subtypes that are generalizable across demographic characteristics, including location, ancestry, ethnicity, and
race, which is essential to achieve equitable precision medicine and facilitate the integration of predictive
models in healthcare pipelines.

## Key facts

- **NIH application ID:** 10799083
- **Project number:** 1R21HG013393-01
- **Recipient organization:** UNIVERSITY OF CALIFORNIA LOS ANGELES
- **Principal Investigator:** Elior Rahmani
- **Activity code:** R21 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2023
- **Award amount:** $428,448
- **Award type:** 1
- **Project period:** 2023-09-22 → 2025-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10799083

## Citation

> US National Institutes of Health, RePORTER application 10799083, Subtyping complex phenotypes via constrastive learning by leveraging electronic health records (1R21HG013393-01). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10799083. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
