# Probabilistic modeling of observational clinical data for high-throughput inference of disease phenotypes

> **NIH NIH F31** · COLUMBIA UNIVERSITY HEALTH SCIENCES · 2021 · $51,036

## Abstract

PROJECT SUMMARY/ABSTRACT
 Today's healthcare infrastructure supports the production and storage of clinical data on a massive scale. A
central goal in clinical informatics is to leverage these data to improve our understanding of health and disease.
However, a major challenge is the paucity of reliable disease labels in observational data. Disease phenotypes
address this issue by summarizing the characteristics of specific diseases in terms of commonly observed clinical
variables. Classically, disease phenotypes are engineered via a manual expert-driven approach which fails to
scale to large numbers of diseases. Data-driven methods for disease phenotyping aim to obtain large numbers
of disease phenotypes by directly modeling large-scale observational clinical data. Such high-throughput
methods may scale, but generally cannot guarantee identifiability; that is, inferred phenotypes are not
guaranteed to map to specific diseases. In addition, data-driven disease phenotyping methods generally model
phenotypes independently with no effort to capture relationships among diseases which would be consistent with
our understanding of comorbidities, disease progression trends, and disease type/subtype relationships.
 The long-term goal of the proposed research is to support large-scale analysis of observational clinical data
by introducing a family of closely related models for high-throughput disease phenotyping which resolve the
issue of identifiability and model relationships among diseases. My work is inspired by an unsupervised
probabilistic graphical model for high-throughput phenotyping, UPhenome. My objective is to derive,
implement, validate, and disseminate UPhenome-based models which will 1) process both biomedical knowledge
and clinical data to yield identifiable phenotypes and 2) model co-occurrence, temporal, and hierarchical
relationships among inferred phenotypes. My central hypothesis is that UPhenome-based models can
support large-scale clinical data analysis by inferring phenotypes that effectively represent the
clinical characteristics of specific diseases while also capturing common comorbidities (co-
occurrence model), patterns of disease progression (temporal model), and organizing diseases
into types and subtypes (hierarchical model). To test this hypothesis, I propose the following aims.
Aim 1: I describe Guided UPhenome, a model which process biomedical knowledge and clinical data to yield
identifiable phenotypes. The model's capacity for capturing disease-specific traits is evaluated qualitatively by
clinical experts, and quantitatively in disease-specific cohort selection tasks versus a gold-standard and a
competing algorithm. Aim 2: I detail extensions to UPhenome which allow for modeling of disease relationships.
The meaningfulness of these relationships is evaluated qualitatively using a series of custom “intrusion tasks”
inspired by the topic modeling literature. Aim 3: I will disseminate UPhenome-based models by ensuring the...

## Key facts

- **NIH application ID:** 10181074
- **Project number:** 5F31LM012894-04
- **Recipient organization:** COLUMBIA UNIVERSITY HEALTH SCIENCES
- **Principal Investigator:** Victor Alfonso Rodriguez
- **Activity code:** F31 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $51,036
- **Award type:** 5
- **Project period:** 2018-07-01 → 2023-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10181074

## Citation

> US National Institutes of Health, RePORTER application 10181074, Probabilistic modeling of observational clinical data for high-throughput inference of disease phenotypes (5F31LM012894-04). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10181074. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*