This project aims to advance how machines interpret video content by developing new capabilities for analyzing extended video streams; that is, ranging from several minutes to multiple hours, which is far beyond the short clips most current systems are designed to handle. As videos continue to dominate digital communication and information sharing, the ability to understand video over extended timescales is becoming increasingly essential. This research will support both live and recorded formats and encompass a broad spectrum of video sources, including footage from wearable, mobile, and fixed cameras. By equipping intelligent systems with the capacity to comprehend complex, time-varying visual information, the project is expected to drive progress in real-world applications such as interactive assistance, autonomous navigation, augmented reality, and content summarization. The primary technical challenge addressed by this project is the extreme data volume inherent in long video sequences, which can produce millions of representational units -- known as tokens -- when processed by modern vision-language models based on transformer architectures. This exceeds the context length limits of current models and hinders effective reasoning over long time horizons. To overcome these limitations, the project proposes a novel framework centered on token selection and context-aware representation. Instead of encoding entire video streams, the system will prioritize a small, highly