While natural intelligence often learns sophisticated skills through simple visual observation, current artificial intelligence (AI) systems largely lack this ability. Most modern AI systems require massive amounts of text or human-labeled data to understand the world, a dependency that limits the ability of machines to perform complex physical tasks that are difficult to describe in words. To address these limitations, this project creates a scientific framework that allows machines to learn directly from passive observations and active interactions with the physical environment. The project moves toward a new paradigm where video serves as the primary medium for machine intelligence, enabling autonomous systems to plan and act by observing videos of human and robotic behavior. By fostering the development of more capable and helpful autonomous agents, the project serves the national interest in AI leadership and technical workforce development. These efforts include specialized course design, cross-disciplinary collaborations, and mentorship programs that support students from high school through the doctoral level. This research establishes a new paradigm for machine intelligence centered on the concept of adaptable video blueprints. These blueprints function as a representation that allows an agent to translate a visual experience into a sequence of physical actions generalizable across diverse tasks, environments, and embodiments. Three integrated thrusts drive the technical approach. The first thrust develops visual planners that utilize video generation models to causally predict the future states necessary to reach a target goal. The second thrust focuses on inverse dynamics to map these predicted visual sequences into specific motor commands. The third thrust implements an automatic self-improvement loop, allowing the agent to refine its planning and execution through continuous experience and adaptation. This award advances the fields of computer vision