# Structure-Function-Aware Large Protein Language Models for Enhanced Biomedical Applications

> **NIH NIH R01** · UNIVERSITY OF KENTUCKY · 2024 · $343,124

## Abstract

Large protein language models have shown their foundational role in biomedical research. However, two
challenges are roadblocking their broad applications: (a) the absence of critical knowledge about protein
structure and functions in the models, and (b) the lack of efficient approaches to adapt a trained protein
language model. To address the two challenges, we propose to develop protein language models with
knowledge of protein structures and functions, and adaptation methods that can provide accurate predictions
for protein properties. The goal is to develop and validate the structure-function-aware large protein
language models (SF-PLM) that could be adapted to operate challenging biomedical research tasks
using few-shot learning. We hypothesize that (a) multi-view contrastive learning can fuse 2D/3D structural
information into 1D representation, (b) well-developed reinforcement learning can align a large protein language
model with the related function annotation, and (c) prompt tuning can realize a few-shot learning process to
adapt the trained models to specific biomedical tasks. Inspired by the hypotheses, we develop three Specific
Aims to help achieve the proposal's goal. Aim 1: Develop large protein language models aware of 2D and 3D
structures using multi-view contrastive learning. We will develop the encoders for the protein 1D, 2D, and 3D
structures; optimize the model training procedure and contrastive loss functions, and validate and select the
developed models using structure-oriented downstream tasks. Aim 2: Develop a reinforcement learning-
based method to align knowledge of protein functions with the structure-aware large protein language models.
We will start by developing an initial policy model, further develop the reward model and proximal policy
optimization to align the trained large protein language models and validate and select the aligned large
protein language models. Aim 3: Develop prompt technologies and tools to adapt structure-function-aware
large protein language models for downstream tasks. We will develop prompt tuning to adapt the trained
protein language models for antimicrobial peptide design and predict the targets and phosphorylation
strengths for polo-like kinase 1 (PLK1), an overexpressed kinase in cancer cells. We will build utilities to
enable the community usage of the prompt tuning. The success of the proposed research will lead to (a) the
development of novel large protein language models aware of structures and functions, (b) prompt-based
efficient adaptation of trained large protein language models for downstream tasks, (c) several novel
antimicrobial peptides, (d) a list of predicted substrates and their phosphorylation strengths of PLK1, and (f)
a library of Python code that enables the development of the pre-trained protein language models and
efficient prompt tuning. These outcomes will provide and validate fundamental deep learning tools for
biomedical research. The outcome (c) and (d) will ...

## Key facts

- **NIH application ID:** 10859349
- **Project number:** 1R01LM014510-01
- **Recipient organization:** UNIVERSITY OF KENTUCKY
- **Principal Investigator:** Qing Shao
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $343,124
- **Award type:** 1
- **Project period:** 2024-06-21 → 2028-05-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10859349

## Citation

> US National Institutes of Health, RePORTER application 10859349, Structure-Function-Aware Large Protein Language Models for Enhanced Biomedical Applications (1R01LM014510-01). Retrieved via AI Analytics 2026-05-27 from https://api.ai-analytics.org/grant/nih/10859349. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
