CC* Integration-Large: Scaling Scientific Workloads on Distributed Commodity GPUs and Storage through Campus-level RDMA Networking

NSF Award Search · 01002526DB NSF RESEARCH & RELATED ACTIVIT · $750,000 · view on nsf.gov ↗

Abstract

Scientific workloads have outgrown the capabilities of today's campus networks, driven by two key trends: the increasing adoption of machine learning (ML) in scientific research, and the growing need to access and process large-scale datasets. Remote Direct Memory Access (RDMA) has emerged as a key network technology to provide high-bandwidth, low-latency communication for distributed ML and fast data storage. This project explores an RDMA-based campus network design and implementation that enables the shared use of distributed, heterogeneous Graphics Processing Units (GPUs) to accelerate scientific applications and fast access to research data storage. The project entails four research thrusts. First, high-bandwidth, low-latency RDMA network infrastructure will be established to connect campus GPUs using standard data center-class network hardware. Second, new workload scheduling systems and algorithms will be developed to make efficient usage of the RDMA network. Third, storage disaggregation over RDMA will be enabled, allowing compute servers to access remote NVMe-class storage with minimal performance overheads. Finally, varied science applications, such as large language models (LLMs), domain-specific natural language processing (NLP), medical image processing, and cryo-electron microscopy (CryoEM) will be evaluated on top of the RDMA network, the workload scheduler, and the disaggregated storage. This project presents a first step toward improving the efficiency

Key facts

NSF award ID
2503010
Awardee
Duke University (NC)
SAM.gov UEI
TP7EK8DZV6N5
PI
Danyang Zhuo
Primary program
01002526DB NSF RESEARCH & RELATED ACTIVIT
All programs
Estimated total
$750,000
Funds obligated
$750,000
Transaction type
Standard Grant
Period
07/01/2025 → 06/30/2027