# Identifying Sepsis Phenotypes Associated with Antibiotic-Resistant Pathogens Using Large Language Models and Machine Learning

> **NIH AHRQ K08** · MASSACHUSETTS GENERAL HOSPITAL · 2024 · $147,122

## Abstract

Sepsis is a major driver of broad-spectrum antibiotic use, in large part because best practice guidelines and
quality measures recommend immediate administration of antimicrobial therapy broad enough to cover all
likely pathogens for patients with suspected sepsis. This leads to many patients receiving antibiotics covering
methicillin-resistant Staphylococcus aureus (MRSA) and multidrug-resistant (MDR) Gram-negative organisms,
despite only a tiny fraction having evidence of these infections. Overly broad antibiotics contribute to antibiotic
resistance and increase the risk of acute kidney injury, Clostridioides difficile infection, and potentially even
mortality. Clear guidance on when to choose empiric narrow spectrum therapy has been lacking, however, in
large part because simple models predicting the risk of MDR pathogens have not shown high enough accuracy
to enable clinicians to safely withhold broad-spectrum therapy. Phenotyping studies that have attempted to
elucidate subtypes of sepsis by applying clustering methods to detailed electronic health record (EHR) data
have shown promise in predicting responses to varying treatment strategies, but none so far have sought to
inform the appropriate initial breadth of antibiotic therapy. Previous studies have also ignored key data
available in clinical notes, including presenting symptoms and recent antibiotic and healthcare exposures, even
though those data may be associated with the likelihood of MDR infection. My central hypothesis is that
objective EHR data can be augmented with key information extracted from unstructured free text notes using
large language models (LLMs) to define phenotypes of suspected sepsis that predict the value of initial anti-
MRSA and anti MDR Gram-negative therapy. I will investigate this in a large multihospital cohort through the
following specific aims: (1) Apply LLMs to clinical notes to identify presenting syndromes, recent antibiotic
exposures, and recent healthcare exposures in patients with suspected sepsis, and assess the marginal value
of these free-text data vs structured data alone to predict antibiotic choice and appropriateness; (2) Quantify
the impact of overly broad vs appropriate vs overly narrow antibiotics on patient outcomes in patients with
culture-positive sepsis; and (3) Identify phenotypes of suspected sepsis for whom the negative impact of overly
broad early antibiotic coverage most likely outweighs therapeutic benefit. The candidate, Dr. Theodore Pak,
MD, PhD, is an infectious diseases fellow at Massachusetts General Hospital and a physician-scientist with
experience in software engineering, bioinformatics, and analysis of EHR data. Dr. Pak’s goals during the K08
period are to obtain advanced training in EHR data analysis; learn advanced causal inference and machine
learning methods; gain experience in artificial intelligence and natural language processing of clinical text; and
strengthen abilities in scientific communication and leadership...

## Key facts

- **NIH application ID:** 10948425
- **Project number:** 1K08HS030118-01
- **Recipient organization:** MASSACHUSETTS GENERAL HOSPITAL
- **Principal Investigator:** Theodore Robertson Pak
- **Activity code:** K08 (R01, R21, SBIR, etc.)
- **Funding institute:** AHRQ
- **Fiscal year:** 2024
- **Award amount:** $147,122
- **Award type:** 1
- **Project period:** 2024-08-01 → 2029-07-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10948425

## Citation

> US National Institutes of Health, RePORTER application 10948425, Identifying Sepsis Phenotypes Associated with Antibiotic-Resistant Pathogens Using Large Language Models and Machine Learning (1K08HS030118-01). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10948425. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*