NeTS:Small: Efficient Collective Communication for Distributed ML in the Cloud

NSF Award Search · 01002526DB NSF RESEARCH & RELATED ACTIVIT · $525,000 · view on nsf.gov ↗

Abstract

Machine learning (ML) has transformed how we solve complex problems, from understanding languages to making accurate predictions in medicine and economics. However, modern ML models have grown extremely large—often involving trillions of parameters—that they can no longer run efficiently on a single computer. Instead, these enormous models must be distributed across many powerful processors, known as accelerators, in data centers. A critical challenge in running distributed ML models efficiently is managing the communication between accelerators. When different accelerators share information, this process — called collective communication — becomes a bottleneck, slowing down training and inference tasks. Current approaches to managing communication assume all connections between accelerators are equal. But in reality, connections can vary widely in speed and capacity, creating inefficiencies. This project aims to significantly improve collective communication by creating software tools and algorithms specifically designed for the diverse connections found in modern cloud-based accelerator systems. First, the project will measure how communication speeds and delays vary between accelerators, accounting for complexities like proprietary technologies and hidden network paths within data centers. Next, these measurements will be used to automatically generate optimized collective communication strategies tailored to specific cloud setups. This approach ensures that each deploy

Key facts

NSF award ID: 2435852
Awardee: Cornell University (NY)
SAM.gov UEI: G56PUALJ3KT5
PI: Rachee Singh
Primary program: 01002526DB NSF RESEARCH & RELATED ACTIVIT
All programs: SMALL PROJECT
Estimated total: $525,000
Funds obligated: $525,000
Transaction type: Standard Grant
Period: 07/15/2025 → 06/30/2028