OAC Core: AI4MPI: ML-Based Optimization for MPI Library

NSF Award Search · 01002526DB NSF RESEARCH & RELATED ACTIVIT · $584,257 · view on nsf.gov ↗

Abstract

The Message Passing Interface (MPI) standard has been the de-facto communication approach for scaling large scientific problems on large High-Performance Computing (HPC) clusters. Recently, MPI libraries are also being used for scaling Deep Learning (DL) and Machine Learning (ML) applications on HPC clusters. The performance and scaling of an MPI library for HPC and AI applications are heavily dependent on the optimization of the underlying point-to-point protocols and collective communication algorithms. These optimizations, on the other hand, are heavily dependent on the characteristics of the underlying cluster architecture involving CPU, GPU, memory, and interconnects. The current approach used by the MPI library developers is to carry out such optimizations in an offline, static, and manual manner. This makes the task very time-consuming. This project develops a novel AI4MPI approach where AI techniques can be developed and used for optimizing point-to-point protocols and collective algorithms for current and next-generation HPC clusters with diverse characteristics. The proposed approach will enable MPI library developers to optimize their MPI libraries with significantly reduced effort for a range of clusters with varying characteristics, and deliver higher performance for a range of HPC and AI applications. The project will provide valuable guidelines for designing and deploying next-generation HPC and AI systems, benefiting users in academia and industry. The res

Key facts

NSF award ID
2504944
Awardee
OHIO STATE UNIVERSITY, THE (OH)
SAM.gov UEI
DLWBSLWAJWR1
PI
Dhabaleswar K Panda
Primary program
01002526DB NSF RESEARCH & RELATED ACTIVIT
All programs
Artificial Intelligence (AI)
Estimated total
$584,257
Funds obligated
$584,257
Transaction type
Standard Grant
Period
07/01/2025 → 06/30/2028