Applying Large Language Models to Accelerate Abstraction of Cancer Pathology Reports for Cancer Registry (LLMs for Unstructured Data Extraction)

NIH RePORTER · NIH · P30 · $299,999 · view on reporter.nih.gov ↗

Abstract

Pathology reports, containing critical information on tissue samples and lesions, play a significant role in determining cancer treatment selection, prognosis, risk stratification, and clinical trial screening. Yet, manually extracting tumor characteristics from these unstructured or semi-structured reports is a complex, laborious process. Recent advances in Natural Language Processing (NLP) via deep learning methodologies show promising potential. Though Bidirectional Encoder Representations from Transformers (BERT) has achieved notable results in various NLP tasks, its application in pathology is constrained due to the limited allowable input length. Our recent study addressed this by transfer learning a BERT-based model on increasingly complex knowledge sources including Wikipedia, PubMed, MIMIC-III, and Moffitt institutional pathology reports. This language model was further fine-tuned to identify site, histology, and associated ICD-O-3 codes from pathology reports. Despite the promising preliminary results, our pilot work focuses on extractive question-anwsering of single primary solid tumor diagnosis, overlooking rich terminology and variation of the pathology language. Our long-term goal is to employ Large Language Models (LLMs) to extract information from all types of clinical notes, assisting institutional certified tumor registrars in data abstraction for the Cancer Registry. In this work, we specifically focus on pathology reports, and proposes to train LLMs on 349,544 institutional pathology reports to identify five key cancer data elements: primary site, histology, stage, grade, and laterality. The study will focus on common (breast) and rare (gastric) cancers. We will leverage existing LLMs pretrained on large public corpora, retrain them on institutional pathology reports, and finally fine-tune them to predict specific cancer data elements. We pursue two specific aims. Aim 1: predict breast cancer data elements by abstractive question-aswering using the existing cabernet architecture (Aim 1a), and by a prompt-based finetuning technique (Aim 1b). Aim 2: utilize zero-shot inference (Aim 2a) and soft-prompt tuning (Aim 2b) on these fine-tuned models to predict gastric cancer data elements. This proposal is innovatite by using LLMs to identify key cancer data elements in real-world settings, and has broad impacts by accelerating research, streamlining cancer registry operations, and fostering the development of effective cancer prevention and treatment therapies.

Key facts

NIH application ID: 10890243
Project number: 3P30CA076292-25S4
Recipient: H. LEE MOFFITT CANCER CTR & RES INST
Principal Investigator: John L. Cleveland
Activity code: P30
Funding institute: NIH
Fiscal year: 2023
Award amount: $299,999
Award type: 3
Project period: 1998-02-18 → 2027-01-31