PROJECT 1 ABSTRACT Overcoming methodologic barriers to analysis of observational clinico-genomic data in oncology Project Leaders: Kenneth Kehl (DFCI); Deborah Schrag (MSK) Precision oncology, which seeks to identify biomarkers to guide treatment selection for individual patients, has been applied increasingly in cancer research and clinical care. Pursuing this objective requires access to large databases of tumors that have been both molecularly characterized and clinically annotated. However, the absence of scalable methods for gathering and analyzing the clinical endpoints necessary to pursue patient-relevant research questions has been a major barrier to constructing such datasets. Key cancer outcomes, including response to treatment, are generally not recorded in a structured format in “real-world” electronic health record (EHR) datasets. Extraction of such outcomes from EHRs has historically required resource-intensive manual medical records review, which has in turn has suffered from the lack of a standardized data model for medical record annotation across studies. Real-world molecular testing and follow-up patterns, which may be correlated with endpoints of interest, constitute an additional challenge to clinico-genomic analysis. Methods to reliably extract clinically interpretable, reproducible endpoints from EHRs are necessary to advance precision oncology. The overarching objective of this proposal is to develop, refine, and test such methods at scale. Towards this end, we have developed the Pathology, Radiology/Imaging, Signs/Symptoms, Medical oncologist assessment, and bioMarkers (PRISSMM) data model for extracting structured, reproducible cancer outcomes. PRISSMM provides a rubric for abstraction of specific cancer outcomes from individual imaging reports and medical oncologist notes and can be used by investigators at any health care system, agnostic to EHR vendor. These outcomes include the presence of cancer within specific EHR imaging reports and clinical notes, including assessments of tumor at specific body sites; progression/worsening; and response/improvement. Annotations of individual reports along the disease trajectory can then be analyzed to derive relevant endpoints, such as progression-free survival. Still, these “real-world” endpoints will only be useful if they (1) are acceptable to diverse stakeholders; (2) can be extracted at scale; and (3) can be analyzed using methods that facilitate unbiased inference. In this project, we will evaluate novel PRISSMM endpoints by measuring associations among PRISSMM outcomes, traditional RECIST endpoints, and overall survival; train and validate machine learning/”AI” models to extract endpoints at the scale of a large cross-institutional clinico-genomic dataset; and develop best practices for time-to-event analysis given informative cohort entry and follow-up patterns in clinico-genomic data. This project will advance methods for cancer outcome analysis based on real-world evide...