ABSTRACT Delirium, or acute confusional state, affects 30-40% of hospitalized older adults, with the added cost of care estimated to be up to $7 billion. Although originally conceptualized as a transient disorder, delirium is now recognized to have significant consequences, including increased risk of death, functional decline, and long-term cognitive impairment. As up to 75% of cases are not recognized by providers, there is an critical need for advanced methods to identify delirium for clinical and research purposes, and to stratify patients based on delirium risk. Unfortunately, surveillance of delirium is time consuming, resulting in few institutions implementing systematic screening procedures on all older adults. Research studies that use regular delirium screening assessments are typically small, single institution studies in a subset of patients that do not release their data for replication due to concern over privacy, lack of technical expertise, or other concerns. Epidemiological studies that do not use delirium screening assessments rely on proxy measures, such administrative codes with sensitivity as low as 3%, instead of chart review which can recover up to 74% of delirium cases. Advanced methods such as natural language processing (NLP) and machine learning (ML) have the potential to automate this chart review process and facilitate large-scale studies of delirium, but are hampered by lack of suitable data for algorithm development. We propose to leverage systematic delirium screening available through the University of Alabama of Birmingham (UAB) Virtual Acute Care for Elders (ACE) quality improvement program to create and release a de-identified delirium dataset to address the data and diagnosis gap in epidemiological studies of delirium. Our Virtual ACE program has determined delirium status on more than 33,000 patients across a six-year period, providing a rich set of data from which this project will draw. We will test the hypothesis that our transfer learning based deidentification method can assist annotators to more rapidly de-identify clinical text, opening the door to larger, faster, more widely available dataset releases. To validate the utility of the full dataset, we will determine the statistical power of our de-identified corpus to detect differences between participants with and without delirium in commonly used ML study designs. Our delirium dataset release, containing 3,000 de-identified clinical notes and associated structural data, will be one of the largest text corpora ever released and the only text inclusive corpora specifically for the study of delirium. The proposed dataset will be available for download on Physionet with a Data Use Agreement (DUA) to facilitate further development of NLP and ML approaches for determining delirium status, risk factors, and sequelae at other institutions and in other populations by transfer learning. Release of our de-identification algorithms and methodology will also faci...