Artificial intelligence systems can now generate realistic video, but they still struggle to learn the world knowledge needed to understand how environments change over time, anticipate the consequences of actions, and support decision- making in the physical world. This limitation is a major barrier to building machines that can safely and effectively assist people in homes, workplaces, and scientific settings. By developing learning methods that extract action-relevant structure directly from raw video and other sensor data, this project will help lay the foundation for more capable and adaptable intelligent systems, with potential benefits for robotics, scientific discovery, and other applications that require reliable machine perception. The project will also create open educational materials and mentorship activities that train students across vision, robotics, and machine learning. This project develops a self-supervised framework for video representation learning that separates efficient perception modules from generative world models, enabling the discovery of compact representations of scene state, motion, and action from raw sensory streams without dense human annotation. The research will study learning objectives and architectures that support long-context prediction, planning, and action-conditioned world modeling, while also yielding representations that can implicitly support conventional vision capabilities such as 3D reconstruction, motion estimation, and