Subtyping complex phenotypes via constrastive learning by leveraging electronic health records

NIH RePORTER · NIH · R21 · $428,448 · view on reporter.nih.gov ↗

Abstract

Summary A critical step towards realizing the promise of precision medicine is the identification of biologically- and clinically-relevant disease subtypes. Disease subtypes are suspected yet unknown or not fully characterized for many conditions, including obesity, diabetes, hypertension, asthma, dementia, and psychiatric disorders. The existence of “phenotypic heterogeneity” has practical and clinical implications: undifferentiated cases of a disease may represent the action of a variety of underlying causal processes, each of which may have a different prognosis or respond to a different treatment. Existing phenotype subtyping methods predominantly rely on the idea that applying clustering or dimensionality reduction techniques to high-dimensional data from patients with a given condition may reveal explanatory patterns that correspond to disease subtypes. This implicitly assumes that biologically meaningful subtypes can be captured by the dominant axes of variation in the data. Yet, the most dominant sources of variation are expected to be independent of biologically meaningful subtypes in many settings. In this project, a novel contrastive learning method is proposed for learning a heterogeneity gradient of variation that is specific to cases of a given condition and cannot be found in matched controls. Electronic health records (EHR) and survey information from the rich All of Us database is expected to span the spectrum of clinical heterogeneity across common complex diseases, which can inform the proposed method about meaningful sub-phenotypic variation for many diseases. The subtypes identified will be evaluated within the All of Us database and replicated using three external EHR cohorts for subtype- specific genetic effects, clinical risk factors, and clinical trajectories. Finally, EHR-based models are notoriously known for their susceptibility to poor generalization on out-of-distribution data that represent locations, populations, medical practices, or other factors that were not represented in the training data. This challenge will be addressed by developing a domain generalization framework, which will allow learning disease subtypes that are generalizable across demographic characteristics, including location, ancestry, ethnicity, and race, which is essential to achieve equitable precision medicine and facilitate the integration of predictive models in healthcare pipelines.

Key facts

NIH application ID: 10799083
Project number: 1R21HG013393-01
Recipient: UNIVERSITY OF CALIFORNIA LOS ANGELES
Principal Investigator: Elior Rahmani
Activity code: R21
Funding institute: NIH
Fiscal year: 2023
Award amount: $428,448
Award type: 1
Project period: 2023-09-22 → 2025-08-31