# NeTS:Small: Efficient Collective Communication for Distributed ML in the Cloud

> **NSF 01002526DB NSF RESEARCH & RELATED ACTIVIT** · Cornell University (NY) · $525,000

## Abstract

Machine learning (ML) has transformed how we solve complex problems, from understanding languages to making accurate predictions in medicine and economics. However, modern ML models have grown extremely large—often involving trillions of parameters—that they can no longer run efficiently on a single computer. Instead, these enormous models must be distributed across many powerful processors, known as accelerators, in data centers. A critical challenge in running distributed ML models efficiently is managing the communication between accelerators. When different accelerators share information, this process — called collective communication — becomes a bottleneck, slowing down training and inference tasks. Current approaches to managing communication assume all connections between accelerators are equal. But in reality, connections can vary widely in speed and capacity, creating inefficiencies.

This project aims to significantly improve collective communication by creating software tools and algorithms specifically designed for the diverse connections found in modern cloud-based accelerator systems. First, the project will measure how communication speeds and delays vary between accelerators, accounting for complexities like proprietary technologies and hidden network paths within data centers. Next, these measurements will be used to automatically generate optimized collective communication strategies tailored to specific cloud setups. This approach ensures that each deploy

## Key facts

- **NSF award ID:** 2435852
- **Awardee organization:** Cornell University (NY)
- **SAM.gov UEI:** G56PUALJ3KT5
- **PI:** Rachee Singh
- **Primary program:** 01002526DB NSF RESEARCH & RELATED ACTIVIT
- **All programs:** SMALL PROJECT
- **Estimated total:** $525,000
- **Funds obligated:** $525,000
- **Transaction type:** Standard Grant
- **Period:** 07/15/2025 → 06/30/2028

## Primary source

NSF Award Search: https://www.nsf.gov/awardsearch/showAward?AWD_ID=2435852

## Citation

> US National Science Foundation, Award 2435852, NeTS:Small: Efficient Collective Communication for Distributed ML in the Cloud. Retrieved via AI Analytics 2026-06-07 from https://api.ai-analytics.org/grant/nsf/2435852. Licensed CC0.

---

*[NSF Awards dataset](/datasets/nsf-awards) · CC0 1.0*
