# CAREER: Learning to See, Reason, and Act: Language-Grounded Video Intelligence for Perceptual AI Agents

> **NSF 01002627DB NSF RESEARCH & RELATED ACTIVIT** · University of North Carolina at Chapel Hill (NC) · $584,937

## Abstract

Modern artificial intelligence (AI) systems can recognize simple actions in short video clips, yet they cannot understand the full complexity of events unfolding over time. They struggle to follow multi-step activities, explain why something happened, imagine what would happen under different circumstances, or use what they observe to guide physical actions. These limitations prevent AI from having a transformative impact in areas of national importance such as robotics, healthcare, and manufacturing, where understanding dynamic visual scenes is essential. This project will develop a new generation of AI systems that can watch videos of the real world, understand what is happening at multiple levels of detail, reason about causes and consequences, and translate that understanding into purposeful action. For example, a robot watching a person assemble furniture could learn the sequence of steps involved, reason about why a particular step failed, and adapt its own plan accordingly. The educational activities integrated into this project will develop new university courses, mentor undergraduate and high school students, and engage the broader public through outreach programs connecting AI to accessible topics such as sports and skill learning. All software, trained models, and datasets produced by this project will be released publicly to accelerate scientific progress and support the responsible development of AI.

This project develops a unified framework for video perception, reasoning, and control, grounded in natural language supervision. The research is organized around three integrated thrusts. The first thrust develops structured, multi-level video representations by leveraging narrated instructional video. Actions are modeled as learnable transformations applied to object representations, enabling the system to recognize novel combinations of actions and objects not seen during training. A hierarchical extension captures complex activities at multiple tempo

## Key facts

- **NSF award ID:** 2541848
- **Awardee organization:** University of North Carolina at Chapel Hill (NC)
- **SAM.gov UEI:** D3LHU66KBLD5
- **PI:** Gediminas Bertasius
- **Primary program:** 01002627DB NSF RESEARCH & RELATED ACTIVIT
- **All programs:** Artificial Intelligence (AI), CAREER-Faculty Erly Career Dev, ROBUST INTELLIGENCE
- **Estimated total:** $584,937
- **Funds obligated:** $343,839
- **Transaction type:** Continuing Grant
- **Period:** 08/01/2026 → 07/31/2031

## Primary source

NSF Award Search: https://www.nsf.gov/awardsearch/showAward?AWD_ID=2541848

## Citation

> US National Science Foundation, Award 2541848, CAREER: Learning to See, Reason, and Act: Language-Grounded Video Intelligence for Perceptual AI Agents. Retrieved via AI Analytics 2026-07-05 from https://api.ai-analytics.org/grant/nsf/2541848. Licensed CC0.

---

*[NSF Awards dataset](/datasets/nsf-awards) · CC0 1.0*