# Rule-based machine learning to address heterogeneity in high-dimensional survival data

> **NIH NIH F31** · UNIVERSITY OF PENNSYLVANIA · 2021 · $46,036

## Abstract

Project Summary
In the post-genomic era, researchers are met with an abundance of data to analyze and interpret. Genome-
wide association analyses (GWAS) often boast millions of single-nucleotide polymorphisms (SNPs), alongside
increasingly large epigenomic, transcriptomic, proteomic (multi-omic) and other data sets. While the current
standard in genetic epidemiology emphasizes increased sample sizes, we propose that substantial progress
can be made by developing improved methods to analyze the vast amount of multi-omic data that currently
exists. A number of methodological challenges including dimensionality and the multiple testing burden have
limited the success of many approaches thus far. Furthermore, only considering simple, linear associations
leaves out the more likely scenario of complex genetic and multi-omic relationships driving risk and outcomes
in common diseases. Heterogeneity is just one of the complex mechanisms that underlies disease risk and
outcomes, but is arguably among the most difficult to model and detect. This project tackles this and other
challenges in glioma, a highly heterogeneous cancer type. Improving upon available treatment strategies in
cancer and glioma specifically will undoubtedly require a full characterization of genetic heterogeneity and
epigenetic mechanisms. In addition to confronting the dimensionality of genetic and epigenetic data using a
feature selection strategy that can detect both main effects and interaction and preserve heterogeneity, we will
modify an existing method for detecting heterogeneity to accommodate censored survival data. First, in Aim 1,
we will use simulated genetic survival data to establish the utility of a Relief-based feature selection algorithm
in capturing complex genetic architectures (i.e., main effects, heterogeneity, and epistasis). We will compare it
against standard approaches for high-dimensional feature selection of survival data. Aim 2 updates a learning
classifier system (LCS), a type of rule-based machine learning that uses IF/THEN rules to model complex and
heterogeneous problem spaces. To our knowledge, no LCS that handles censored survival data has been
developed to date. After testing our survival LCS on simulated data and comparing it to standard survival
methods, in Aim 3 we will implement it using somatic mutation and methylation data from the TCGA glioma
dataset. Finally, as part of Aim 3, we will perform a pathway analysis using the LCS output in an effort to
identify common biological pathways underlying heterogeneous associations. We will also utilize a network
visualization tool to better understand interactions between features and provide a visual interpretation of the
results. Findings from this project will lay the foundation for precision care and treatment of glioma. Our
innovative approach to high-dimensional, heterogeneous survival data will be both generalizable and
interpretable, qualities that are missing from current machine learning approa...

## Key facts

- **NIH application ID:** 10141575
- **Project number:** 1F31LM013583-01
- **Recipient organization:** UNIVERSITY OF PENNSYLVANIA
- **Principal Investigator:** Alexa Abigail Woodward
- **Activity code:** F31 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $46,036
- **Award type:** 1
- **Project period:** 2021-09-01 → 2024-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10141575

## Citation

> US National Institutes of Health, RePORTER application 10141575, Rule-based machine learning to address heterogeneity in high-dimensional survival data (1F31LM013583-01). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/10141575. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*