RI: Small: Characterizing the Meaning of Human Preferences for AI Alignment

NSF Award Search · 01002526DB NSF RESEARCH & RELATED ACTIVIT · $599,969 · view on nsf.gov ↗

Abstract

Artificial Intelligence (AI) systems are becoming increasingly capable, yet deploying them to produce reliable, intended outcomes remains difficult. Large language models often fail to follow instructions, AI systems that govern important functions sometimes behave unpredictably, and autonomous systems, such as self-driving vehicles or robots, can act in ways that diverge from user expectations. To improve reliability and utility, future AI systems will need to demonstrate that their goals and behaviors consistently reflect and support the intentions of their human users. The dominant current paradigm for AI alignment relies on learning from human preferences over possible actions or outcomes of an AI system. However, such methods make a number of assumptions about how preferences should be interpreted and ignore many potential sources of error. The aim of this project is to improve the scientific characterization of human preferences in the context of AI alignment and leverage that knowledge to practically improve AI systems. More specifically, Reinforcement Learning from Human Feedback (RLHF) is now at the core of many of the most successful contemporary approaches to AI alignment in applications ranging from robotics to language modeling. RLHF aims to align a policy with the desires implied by human preferences between pairs of trajectories, outcomes, or model outputs. However, such approaches typically rely on very strong assumptions about the meaning of human prefere

Key facts

NSF award ID: 2437426
Awardee: University of Massachusetts Amherst (MA)
SAM.gov UEI: VGJHK59NMPK9
PI: Scott D Niekum
Primary program: 01002526DB NSF RESEARCH & RELATED ACTIVIT
All programs: ROBUST INTELLIGENCE, SMALL PROJECT
Estimated total: $599,969
Funds obligated: $599,969
Transaction type: Standard Grant
Period: 09/01/2025 → 08/31/2028