Abstract Major depressive disorder contributes substantially to morbidity, mortality, and health care cost. Standard treatments are ineffective for up to a third of patients, so new treatment options are needed along with strategies to make more effective use of existing treatments. However, progress in expanding therapeutic options has been hindered by heterogeneity in clinical presentation and course of depression. In other disorders such as inflammatory bowel disease, cancer, and dementia, identifying disease subtypes has led to therapeutic discoveries. In major depressive disorder, efforts to identify subtypes based on clinical observation have yielded limited success, primarily because of the lack of availability of adequate cohorts for replication, and because those features most apparent to clinicians may not be the most relevant for differentiating subgroups. Efforts to leverage large electronic health record data sets for subtyping address some of these challenges, but standard approaches may not yield human-interpretable features nor those with value in prediction. The investigators have developed methods for engineering features that balance utility in prediction with interpretability. Preliminary work by the investigators during a year of R56 support yielding 4 publications demonstrates that this approach indeed yields coherent topics without sacrificing predictive validity; electronic health records contain meaningful data that facilitates identification of interpretable patient subgroups. The present study draws on very large cohorts of individuals with major depression, defined by a validated algorithm, in electronic health records from two health systems. It will first apply methods developed by the investigators to identify MDD subtypes. These subtypes will then be examined in terms of predictive validity as well as interpretability by clinicians. The study builds on a productive collaboration between a team experienced in mood disorder phenotyping and clinical investigation, analysis of large-scale longitudinal electronic health records, and development and application of innovative methods in machine learning that yield interpretable models rather than black boxes. Data-driven disease subtyping will facilitate clinically useful risk stratification as well as biological study of mood disorders.