RI: Small: Empowering Longer Video Understanding via Token Compression, Selection, and Reasoning

NSF Award Search · 01002526DB NSF RESEARCH & RELATED ACTIVIT · $600,000 · view on nsf.gov ↗

Abstract

This project aims to advance how machines interpret video content by developing new capabilities for analyzing extended video streams; that is, ranging from several minutes to multiple hours, which is far beyond the short clips most current systems are designed to handle. As videos continue to dominate digital communication and information sharing, the ability to understand video over extended timescales is becoming increasingly essential. This research will support both live and recorded formats and encompass a broad spectrum of video sources, including footage from wearable, mobile, and fixed cameras. By equipping intelligent systems with the capacity to comprehend complex, time-varying visual information, the project is expected to drive progress in real-world applications such as interactive assistance, autonomous navigation, augmented reality, and content summarization. The primary technical challenge addressed by this project is the extreme data volume inherent in long video sequences, which can produce millions of representational units -- known as tokens -- when processed by modern vision-language models based on transformer architectures. This exceeds the context length limits of current models and hinders effective reasoning over long time horizons. To overcome these limitations, the project proposes a novel framework centered on token selection and context-aware representation. Instead of encoding entire video streams, the system will prioritize a small, highly

Key facts

NSF award ID
2519216
Awardee
University of Illinois at Urbana-Champaign (IL)
SAM.gov UEI
Y8CWNJRCNN91
PI
Yuxiong Wang
Primary program
01002526DB NSF RESEARCH & RELATED ACTIVIT
All programs
SMALL PROJECT, ROBUST INTELLIGENCE
Estimated total
$600,000
Funds obligated
$600,000
Transaction type
Standard Grant
Period
09/01/2025 → 08/31/2028