Transfer Learning for Digital Curation of the EMR Clinical Narrative

NIH RePORTER · NIH · R01 · $376,125 · view on reporter.nih.gov ↗

Abstract

Project Summary This proposal is in response to PAR 18-796 to seek support for advancing methodologies for a transfer learning framework for the digital curation of the Electronic Medical Records (EMR) clinical narrative. In the current era of increasing importance of Artificial Intelligence (AI) in biomedicine, our proposal tackles a critical AI component – automated annotation of health-related text. Since 2015 the development and application of machine learning (ML) methods has exploded propelled by the convergence of plentiful digitized unstructured data (text, speech, images), hardware and the refinement of neural networks or deep learning. 2018 marked a turning point in Natural Language Processing (NLP), particularly transfer learning through pre-trained models like Universal Language Model Fine-tuning for Text Classification, Allen AI's ELMO, OpenAI's Open-GPT. In November 2018, Google published the Bidirectional Encodings Representations from Transformers (BERT), a transformer-based model pre-trained on massive general text databases (3.3B words total). The publication reported using BERT representations to build classifiers for 11 NLP tasks which outperformed the state-of-the- art (SOTA) with large margins. The NLP research community jumped to the idea of exploring this new framework but quickly came to the realization that building BERT-style models from scratch is affordable and feasible to only a few. Thus, research investigation proceeded in the direction of using these gigantic models as resources for language representations. Scientific efforts focused on pre-trained models (e.g. BERT) as a source of extracting high quality language features or fine-tuning on a specific task, i.e. using a model as a checkpoint and re-training with much smaller amounts of task-specific data to produce predictions by typically adding one fully-connected layer on top of the representations and training for a few epochs. This general watershed shift in NLP to transfer learning which parallels the developments in computer vision a few years ago coupled with our latest work brings to the forefront a critical NLP research topic ripe for exploration – a transfer learning framework for the digital curation of the EMR clinical narrative. The proposed work is research of novel scientific methods for extracting detailed information from health-related text especially the EMR, the major source of phenotype data for patients. Precise phenotype information is needed to advance translational research, particularly to unravel the effects of genetic, epigenetic, and systems changes on responsiveness. This research is in line with the latest developments in neural deep learning approaches and AI in general and is expected to enhance biomedical research and through that the health of the public.

Key facts

NIH application ID: 10468604
Project number: 5R01LM013486-02
Recipient: BOSTON CHILDREN'S HOSPITAL
Principal Investigator: GUERGANA K. SAVOVA
Activity code: R01
Funding institute: NIH
Fiscal year: 2022
Award amount: $376,125
Award type: 5
Project period: 2021-08-12 → 2025-05-31