Rule-based machine learning to address heterogeneity in high-dimensional survival data

NIH RePORTER · NIH · F31 · $46,036 · view on reporter.nih.gov ↗

Abstract

Project Summary In the post-genomic era, researchers are met with an abundance of data to analyze and interpret. Genome- wide association analyses (GWAS) often boast millions of single-nucleotide polymorphisms (SNPs), alongside increasingly large epigenomic, transcriptomic, proteomic (multi-omic) and other data sets. While the current standard in genetic epidemiology emphasizes increased sample sizes, we propose that substantial progress can be made by developing improved methods to analyze the vast amount of multi-omic data that currently exists. A number of methodological challenges including dimensionality and the multiple testing burden have limited the success of many approaches thus far. Furthermore, only considering simple, linear associations leaves out the more likely scenario of complex genetic and multi-omic relationships driving risk and outcomes in common diseases. Heterogeneity is just one of the complex mechanisms that underlies disease risk and outcomes, but is arguably among the most difficult to model and detect. This project tackles this and other challenges in glioma, a highly heterogeneous cancer type. Improving upon available treatment strategies in cancer and glioma specifically will undoubtedly require a full characterization of genetic heterogeneity and epigenetic mechanisms. In addition to confronting the dimensionality of genetic and epigenetic data using a feature selection strategy that can detect both main effects and interaction and preserve heterogeneity, we will modify an existing method for detecting heterogeneity to accommodate censored survival data. First, in Aim 1, we will use simulated genetic survival data to establish the utility of a Relief-based feature selection algorithm in capturing complex genetic architectures (i.e., main effects, heterogeneity, and epistasis). We will compare it against standard approaches for high-dimensional feature selection of survival data. Aim 2 updates a learning classifier system (LCS), a type of rule-based machine learning that uses IF/THEN rules to model complex and heterogeneous problem spaces. To our knowledge, no LCS that handles censored survival data has been developed to date. After testing our survival LCS on simulated data and comparing it to standard survival methods, in Aim 3 we will implement it using somatic mutation and methylation data from the TCGA glioma dataset. Finally, as part of Aim 3, we will perform a pathway analysis using the LCS output in an effort to identify common biological pathways underlying heterogeneous associations. We will also utilize a network visualization tool to better understand interactions between features and provide a visual interpretation of the results. Findings from this project will lay the foundation for precision care and treatment of glioma. Our innovative approach to high-dimensional, heterogeneous survival data will be both generalizable and interpretable, qualities that are missing from current machine learning approa...

Key facts

NIH application ID: 10141575
Project number: 1F31LM013583-01
Recipient: UNIVERSITY OF PENNSYLVANIA
Principal Investigator: Alexa Abigail Woodward
Activity code: F31
Funding institute: NIH
Fiscal year: 2021
Award amount: $46,036
Award type: 1
Project period: 2021-09-01 → 2024-08-31