# Big Data Methods for Comprehensive Similarity based Risk Prediction

> **NIH NIH R01** · COLUMBIA UNIVERSITY HEALTH SCIENCES · 2020 · $420,434

## Abstract

Project Summary
Electronic health records (EHR) provide rich source of data about representative populations and are yet to be
fully utilized to enhance clinical decision-making. Conventional approaches in clinical decision-making start
with the identification of relevant biomarkers based on subject-matter knowledge, followed by detailed but
limited analysis using these biomarkers exclusively. As the current scientific literature indicates, many human
disorders share a complex etiological basis and exhibit correlated disease progression. Therefore, it is
desirable to use comprehensive patient data for patient similarity. This proposal focuses on deriving a
comprehensive and integrated score of patient similarity from complete patient characteristics currently
available, including but not limited to 1) demographic similarity; 2) genetic similarity; 3) clinical phenotype
similarity; 4) treatment similarity; and 5) exposome similarity (here exposome defined as all available attributes
of the living environment an individual is exposed to), when some of the aspects may overlap and interact. We
will optimize information fusion and task-dependent feature selection for assessing patient similarity for clinical
risk prediction. Since currently there does not exist a pipeline that is able to extract executable complete
patient determinant data, to achieve the research goal described above, we propose first deliver an open-
source data preparation pipeline that is based on a widely used clinical data standard, the OMOP
(Observational Medical Outcomes Partnership) Common Data Model (CMD) version 5.2. Moreover, to mitigate
common missingness and sparsity challenges in clinical data, we describe the first attempt to represent
patients' sparse clinical information with missingness, including diagnosis information, medication data,
treatment intervention, with a fixed-length feature vector (i.e. the Patient2Vec). This project has four specific
aims. Aim 1 is to develop a clinical data processing pipeline for harmonizing patient information from multiple
sources into a standards-based uniformed data representation and to evaluate its efficiency, interoperability,
and accuracy. Aim 2 is to leverage a powerful machine learning technique, Document2Vec, from the natural
language processing literature, to create an open-source Patient2Vec framework for the derivation of
informative numerical representations of patients. Aim 3 is to develop a unified machine learning clinical-
outcome-prediction framework for Optimized Patient Similarity Fusion (OptPSF) that integrates traditional
medical covariates with the derived numerical patient representations from Patient2Vec (Aim 2) for improved
clinical risk prediction. Aim 4 is to evaluate our similarity framework for predicting 1) the risk of end-stage
kidney disease (ESKD) in general EHR patient population and 2) the risk of death among patients with chronic
kidney disease (CKD).

## Key facts

- **NIH application ID:** 9870948
- **Project number:** 5R01LM013061-02
- **Recipient organization:** COLUMBIA UNIVERSITY HEALTH SCIENCES
- **Principal Investigator:** Krzysztof Kiryluk
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $420,434
- **Award type:** 5
- **Project period:** 2019-02-12 → 2024-01-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9870948

## Citation

> US National Institutes of Health, RePORTER application 9870948, Big Data Methods for Comprehensive Similarity based Risk Prediction (5R01LM013061-02). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/9870948. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*