Sepsis is a major driver of broad-spectrum antibiotic use, in large part because best practice guidelines and quality measures recommend immediate administration of antimicrobial therapy broad enough to cover all likely pathogens for patients with suspected sepsis. This leads to many patients receiving antibiotics covering methicillin-resistant Staphylococcus aureus (MRSA) and multidrug-resistant (MDR) Gram-negative organisms, despite only a tiny fraction having evidence of these infections. Overly broad antibiotics contribute to antibiotic resistance and increase the risk of acute kidney injury, Clostridioides difficile infection, and potentially even mortality. Clear guidance on when to choose empiric narrow spectrum therapy has been lacking, however, in large part because simple models predicting the risk of MDR pathogens have not shown high enough accuracy to enable clinicians to safely withhold broad-spectrum therapy. Phenotyping studies that have attempted to elucidate subtypes of sepsis by applying clustering methods to detailed electronic health record (EHR) data have shown promise in predicting responses to varying treatment strategies, but none so far have sought to inform the appropriate initial breadth of antibiotic therapy. Previous studies have also ignored key data available in clinical notes, including presenting symptoms and recent antibiotic and healthcare exposures, even though those data may be associated with the likelihood of MDR infection. My central hypothesis is that objective EHR data can be augmented with key information extracted from unstructured free text notes using large language models (LLMs) to define phenotypes of suspected sepsis that predict the value of initial anti- MRSA and anti MDR Gram-negative therapy. I will investigate this in a large multihospital cohort through the following specific aims: (1) Apply LLMs to clinical notes to identify presenting syndromes, recent antibiotic exposures, and recent healthcare exposures in patients with suspected sepsis, and assess the marginal value of these free-text data vs structured data alone to predict antibiotic choice and appropriateness; (2) Quantify the impact of overly broad vs appropriate vs overly narrow antibiotics on patient outcomes in patients with culture-positive sepsis; and (3) Identify phenotypes of suspected sepsis for whom the negative impact of overly broad early antibiotic coverage most likely outweighs therapeutic benefit. The candidate, Dr. Theodore Pak, MD, PhD, is an infectious diseases fellow at Massachusetts General Hospital and a physician-scientist with experience in software engineering, bioinformatics, and analysis of EHR data. Dr. Pak’s goals during the K08 period are to obtain advanced training in EHR data analysis; learn advanced causal inference and machine learning methods; gain experience in artificial intelligence and natural language processing of clinical text; and strengthen abilities in scientific communication and leadership...