Collaborative Research: RI: Small: Evaluation Concepts for Assessing and Improving Large Language Models

NSF Award Search · 01002526DB NSF RESEARCH & RELATED ACTIVIT · $300,000 · view on nsf.gov ↗

Abstract

Systems based on modern large language models (LLMs) play an increasing role in how users access information and compose text. For instance, a user executing a web search will increasingly rely on LLMs to summarize their search results, rather than viewing individual web pages, and they might use LLMs to “talk to” long documents like financial reports, rather than reading them in their entirety. To support these new paradigms, it is important that an LLM be able to generate responses that are factual, informative and safe. However, satisfying these criteria is not sufficient: a response should also be at the right level of abstraction or detail, in the right format, creative where appropriate, and aligned with other user needs. Current practice has neglected evaluation of these more subtle factors. This project proposes to address these shortcomings by identifying a set of “evaluation concepts” to indicate the kinds of areas where LLMs are failing, like “lack of detail in a list.” The project will then develop technology for automatically evaluating and improving LLM responses according to these concepts. This project aims to improve the evaluation and the functionality of LLMs in two ways. First, the project will discover a concept taxonomy and learn how to evaluate LLM responses according to the concepts in that taxonomy. This process will necessitate advances in reward models, which are themselves LLM, customized to reliably score responses. Second, these reward models

Key facts

NSF award ID: 2433072
Awardee: Cornell University (NY)
SAM.gov UEI: G56PUALJ3KT5
PI: Tanya Goyal
Primary program: 01002526DB NSF RESEARCH & RELATED ACTIVIT
All programs: ROBUST INTELLIGENCE, SMALL PROJECT
Estimated total: $300,000
Funds obligated: $300,000
Transaction type: Standard Grant
Period: 07/01/2025 → 06/30/2028