Machine learning (ML) has transformed how we solve complex problems, from understanding languages to making accurate predictions in medicine and economics. However, modern ML models have grown extremely large—often involving trillions of parameters—that they can no longer run efficiently on a single computer. Instead, these enormous models must be distributed across many powerful processors, known as accelerators, in data centers. A critical challenge in running distributed ML models efficiently is managing the communication between accelerators. When different accelerators share information, this process — called collective communication — becomes a bottleneck, slowing down training and inference tasks. Current approaches to managing communication assume all connections between accelerators are equal. But in reality, connections can vary widely in speed and capacity, creating inefficiencies. This project aims to significantly improve collective communication by creating software tools and algorithms specifically designed for the diverse connections found in modern cloud-based accelerator systems. First, the project will measure how communication speeds and delays vary between accelerators, accounting for complexities like proprietary technologies and hidden network paths within data centers. Next, these measurements will be used to automatically generate optimized collective communication strategies tailored to specific cloud setups. This approach ensures that each deploy