Immediate Communication for Distributed AI Tasks

摘要

Large AI models have necessitated efficient communication strategies across multi-GPU and multi-node infrastructures due to their increasing complexity. Current methods focusing on inter-operator overlaps fail when dependencies exist, leading to underutilized hardware. DistFuse addresses this by enabling fine-grained overlapping of computation and communication, triggering communication as soon as data is ready, and reducing latency. Initial experiments show up to 44.3% reduction in communication latency of Llama3-70B inference on a single node, demonstrating its potential to accelerate diverse AI workloads.

出版物
SOSP Workshop on Hot Topics in System Infrastructure
辛继灏
辛继灏
Ph.D. Student in Machine Learning Systems