CAREER: Learning to See, Reason, and Act: Language-Grounded Video Intelligence for Perceptual AI Agents

NSF Award Search · 01002627DB NSF RESEARCH & RELATED ACTIVIT · $584,937 · view on nsf.gov ↗

Abstract

Modern artificial intelligence (AI) systems can recognize simple actions in short video clips, yet they cannot understand the full complexity of events unfolding over time. They struggle to follow multi-step activities, explain why something happened, imagine what would happen under different circumstances, or use what they observe to guide physical actions. These limitations prevent AI from having a transformative impact in areas of national importance such as robotics, healthcare, and manufacturing, where understanding dynamic visual scenes is essential. This project will develop a new generation of AI systems that can watch videos of the real world, understand what is happening at multiple levels of detail, reason about causes and consequences, and translate that understanding into purposeful action. For example, a robot watching a person assemble furniture could learn the sequence of steps involved, reason about why a particular step failed, and adapt its own plan accordingly. The educational activities integrated into this project will develop new university courses, mentor undergraduate and high school students, and engage the broader public through outreach programs connecting AI to accessible topics such as sports and skill learning. All software, trained models, and datasets produced by this project will be released publicly to accelerate scientific progress and support the responsible development of AI. This project develops a unified framework for video perception, reasoning, and control, grounded in natural language supervision. The research is organized around three integrated thrusts. The first thrust develops structured, multi-level video representations by leveraging narrated instructional video. Actions are modeled as learnable transformations applied to object representations, enabling the system to recognize novel combinations of actions and objects not seen during training. A hierarchical extension captures complex activities at multiple tempo

Key facts

NSF award ID: 2541848
Awardee: University of North Carolina at Chapel Hill (NC)
SAM.gov UEI: D3LHU66KBLD5
PI: Gediminas Bertasius
Primary program: 01002627DB NSF RESEARCH & RELATED ACTIVIT
All programs: Artificial Intelligence (AI), CAREER-Faculty Erly Career Dev, ROBUST INTELLIGENCE
Estimated total: $584,937
Funds obligated: $343,839
Transaction type: Continuing Grant
Period: 08/01/2026 → 07/31/2031