The physical world is compositional. A scene is composed of various objects arranged in a way that is governed by physical laws. Each object consists of distinct parts that determine its functionality and affordances. For example, in a scene, the laws of gravity mean that chairs will be arranged on the floor and the rules of functionality dictate that the chair will have enough balance through its base or legs to support a person. Because the image is arranged based on the physical laws and functionality, it makes understanding the scene simpler. This project aims to develop a computer vision framework that learns and understands the physical world in a compositional manner, offering two significant benefits. First, a compositional interpretation of objects and scenes enables intelligent systems to engage in richer physical interactions and accomplish more complex tasks. Second, by decomposing complex entities into simpler constituents and modeling their relationships, this compositional approach addresses fundamental challenges faced by purely data-driven methods, including data inefficiency, the curse of dimensionality, and limited explainability. The outcomes of this project will impact a wide range of emerging applications, including robots that support manufacturing or assist with daily tasks, autonomous vehicles that enhance mobility and safety, and virtual or augmented reality interfaces that facilitate assistive workflows and remote collaboration. This project will ti