The promise of machine learning for novel approaches to archived developmental data

NIH RePORTER · NIH · R21 · $234,231 · view on reporter.nih.gov ↗

Abstract

ABSTRACT. The availability of large data sets from research studies via data depositories is believed to be critical to tackle key questions about “complex diseases” like psychiatric disorders (Farber, 2017; Patel et al., 2022). Consequently, NIMH-supported investigations have been recently mandated to archive all their data. Notably, NIMH is currently funding the archiving of three consecutive grants we completed years ago, the data from which would otherwise be lost to the research community. These projects (plus a recently completed one) together constitute a longitudinal data base on the course and outcome of young patients with research diagnoses of depressive disorders that onset in the juvenile years (data on biological siblings and controls are also available). Juvenile-onset depression (JOD) is a particularly malignant depression phenotype, with a worse overall clinical course and greater functional impairment than later onset depression, and is still not fully understood. Our aim is to develop prototype machine learning (ML) algorithms (which can be customized as needed) to facilitate the analyses of the longitudinal data being archived in the National Data Archive (NDA). The data reflect repeated assessments from ages 7- to 14-years (at the start of study 1) to ages between the late 20’s to early 30’s (end of study 4) on multiple domains of functioning and can yield actionable information about which risk and protective variables/domains best predict clinical and functional outcomes of JOD (e.g., depression recurrence, suicidal behavior, emotional competence). Because commonly used modelling approaches (which typically test a priori defined pathways) cannot accommodate the complexity of our data and key questions about JOD, we demonstrate the novel application of machine learning (ML) approaches. We propose that questions about JOD outcomes exemplify two scenarios. Scenario (A) includes questions about well-established outcomes (e.g., depression recurrence) and a handful of well-known predictors but meager information about the interrelationships among the predictors, particularly along the course of development. Scenario (B) reflects questions about less established outcomes (successful emotion regulation) the predictors of which are not well known, or have only equivocal support. We will demonstrate how to accommodate such scenarios through two ML approaches: probabilistic graphical modeling and ensemble learning methods. We apply these modeling approaches within a developmental framework in a unique way to leverage the wealth of longitudinal information on multiple domains of functioning. To enable researchers to fully utilize the NDA-based (as well as similar) data, we will release the Python code packages we develop and the code for downloading and properly organizing the related data. Our approach may shift current analytic practices in developmental psychopathology research toward models that can optimize the use of such data, r...

Key facts

NIH application ID: 10949256
Project number: 1R21MH137601-01
Recipient: UNIVERSITY OF PITTSBURGH AT PITTSBURGH
Principal Investigator: MARIA KOVACS
Activity code: R21
Funding institute: NIH
Fiscal year: 2024
Award amount: $234,231
Award type: 1
Project period: 2024-08-01 → 2026-07-31