Knowledge-guided automated machine learning methods for modeling the interaction of HIV with addictive drugs

NIH RePORTER · NIH · R01 · $508,201 · view on reporter.nih.gov ↗

Abstract

The primary goal of this project is to bring powerful data mining and analytics methods, as well as computing technology and technical software, to the substance abuse and HIV research communities to enable everyone, regardless of expertise, to model multimodal data for the purpose of disease prediction with the end goal of clinical decision support. Our working hypothesis is that automated machine learning (AutoML) will accelerate the development of innovative strategies for translation of research findings to clinical use by enabling everyone to analyze biomedical data using data mining methods. This project builds on our user-friendly and open-source Tree-Based Pipeline Optimization (TPOT) platform that represents one of the very first and most widely used open-source AutoML methods. A major benefit of this approach is that it makes machine learning accessible to novice users because it takes the guesswork and complexity out of picking, running, tuning, and optimizing machine learning algorithms and the various pre- and post-processing methods. Bringing this technology to the clinical and translational research communities will open the door to broad adoption of data mining methods for embracing the complexity of the relationship between multimodal substance abuse biomarkers and clinical outcomes such as HIV progression and severity. We propose here novel algorithms to adapt and extend TPOT for the large volumes of clinical data that are being collected on patients infected with HIV at Cedars-Sinai Medical Center in Los Angeles. Specifically, we will first develop an ontology-based Addiction KnowledgeBase (AddictionKB) tailored to HIV endpoints and clinical data derived from electronic health records (EHRs) and their relationships with HIV infection and outcomes to inform the machine learning algorithms and assist with interpretation (AIM 1). We will then develop a large language model (AddictionLLM) using Bloom to allow for natural language queries of AddictionKB to perform knowledge-guided feature selection (AIM 2). We will extend our TPOT AutoML (AddictionML) to include special operators to call the AddictionLLM algorithm for automated knowledge-guided feature selection within machine learning pipelines (AIM 3). We will apply AddictionKB, AddictionLLM, and AddictionML to the identification of substance abuse disorders and other clinical measures that are predictive of HIV progression and severity (AIM 4). Finally, we will distribute and support AddictionKB, AddictionLLM, and AddictionML as open-source software (AIM 5).

Key facts

NIH application ID: 10904298
Project number: 1R01LM014572-01
Recipient: CEDARS-SINAI MEDICAL CENTER
Principal Investigator: Jason H. Moore
Activity code: R01
Funding institute: NIH
Fiscal year: 2024
Award amount: $508,201
Award type: 1
Project period: 2024-07-01 → 2029-05-31