# Applying Large Language Models to Accelerate Abstraction of Cancer Pathology Reports for Cancer Registry (LLMs for Unstructured Data Extraction)

> **NIH NIH P30** · H. LEE MOFFITT CANCER CTR & RES INST · 2023 · $299,999

## Abstract

Pathology reports, containing critical information on tissue samples and lesions, play a significant role in
determining cancer treatment selection, prognosis, risk stratification, and clinical trial screening. Yet, manually
extracting tumor characteristics from these unstructured or semi-structured reports is a complex, laborious
process. Recent advances in Natural Language Processing (NLP) via deep learning methodologies show
promising potential. Though Bidirectional Encoder Representations from Transformers (BERT) has achieved
notable results in various NLP tasks, its application in pathology is constrained due to the limited allowable
input length. Our recent study addressed this by transfer learning a BERT-based model on increasingly
complex knowledge sources including Wikipedia, PubMed, MIMIC-III, and Moffitt institutional pathology
reports. This language model was further fine-tuned to identify site, histology, and associated ICD-O-3 codes
from pathology reports. Despite the promising preliminary results, our pilot work focuses on extractive
question-anwsering of single primary solid tumor diagnosis, overlooking rich terminology and variation of the
pathology language.
Our long-term goal is to employ Large Language Models (LLMs) to extract information from all types of clinical
notes, assisting institutional certified tumor registrars in data abstraction for the Cancer Registry. In this work,
we specifically focus on pathology reports, and proposes to train LLMs on 349,544 institutional pathology
reports to identify five key cancer data elements: primary site, histology, stage, grade, and laterality. The study
will focus on common (breast) and rare (gastric) cancers.
We will leverage existing LLMs pretrained on large public corpora, retrain them on institutional pathology
reports, and finally fine-tune them to predict specific cancer data elements.
We pursue two specific aims. Aim 1: predict breast cancer data elements by abstractive question-aswering
using the existing cabernet architecture (Aim 1a), and by a prompt-based finetuning technique (Aim 1b). Aim 2:
utilize zero-shot inference (Aim 2a) and soft-prompt tuning (Aim 2b) on these fine-tuned models to predict
gastric cancer data elements. This proposal is innovatite by using LLMs to identify key cancer data elements in
real-world settings, and has broad impacts by accelerating research, streamlining cancer registry operations,
and fostering the development of effective cancer prevention and treatment therapies.

## Key facts

- **NIH application ID:** 10890243
- **Project number:** 3P30CA076292-25S4
- **Recipient organization:** H. LEE MOFFITT CANCER CTR & RES INST
- **Principal Investigator:** John L. Cleveland
- **Activity code:** P30 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2023
- **Award amount:** $299,999
- **Award type:** 3
- **Project period:** 1998-02-18 → 2027-01-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10890243

## Citation

> US National Institutes of Health, RePORTER application 10890243, Applying Large Language Models to Accelerate Abstraction of Cancer Pathology Reports for Cancer Registry (LLMs for Unstructured Data Extraction) (3P30CA076292-25S4). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/10890243. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
