An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters

arXiv CS Tuesday 02 June 2026, 04:00 UTC By Mingjun Zhang (Infrawaves), Xiaohe Hu (Infrawaves), Menghao Zhang (Beihang University), Ziteng Chen (Infrawaves), Yanmin Jia (Infrawaves), Yan Zhang (Infrawaves), Da Liu (Infrawaves), Qing Chen (Infrawaves), Fangzheng Jiao (Beihang University), Jun Chen (Infrawaves), He Liu (Infrawaves), Aohan Zeng (Tsinghua University), Shuaixing Duan (Zhipu AI), Ruya Gu (Infrawaves), Yang Jing (Infrawaves), Bowen Han (China Unicom Research Institute), Wei Chen (Infrawaves), Wenqi Xie (Infrawaves), Jinlong Hou (Shanghai Innovation Institute), Yuan Cheng (Shanghai Innovation Institute), Hongzhou Zhang (Shanghai AI Power Technology Co., Ltd), Bohua Xu (China Unicom Research Institute), Mingwei Xu (Tsinghua University), Chunming Hu (Beihang University) 1 min read

Key Points

Announce Type: replace Abstract: Large-scale LLM training requires collective communication libraries to exchange data among distributed GPUs. As a company dedicated to building and operating large-scale GPU training clusters, we encounter several practical limitations of NCCL in production, including 1) SM competition between computation and communication, 2) expensive restart costs under link failures, and 3) insufficient observability of transient collective communication anomalies. To...

arXiv:2510.00991v2 Announce Type: replace Abstract: Large-scale LLM training requires collective communication libraries to exchange data among distributed GPUs. As a company dedicated to building and operating large-scale GPU training clusters, we encounter several practical limitations of NCCL in production, including 1) SM competition between computation and communication, 2) expensive restart costs under link failures, and 3) insufficient observability of transient collective communication anomalies. To address these challenges, we propose VCCL, an efficient, reliable, and observable collective communication library in large-scale GPU training clusters. VCCL removes SM-consuming P2P kernels by moving intra-node data movement and stream dependency enforcement to CPU threads and GPU copy engines. VCCL also introduces a primary-backup QP mechanism to tolerate frequent NIC port failures, and designs a window-based monitor to observe network anomalies at O({\mu}s) level. We opensource VCCL and deploy it in production training clusters for several months. Compared with NCCL, VCCL improves training throughput by up to 5.28% and reduces massive GPU resource wastage through runtime fault tolerance and finegrained monitor. We also share experience and lessons we learned during the deployment of VCCL in large-scale clusters.

Observable Collective Communication Library (ORG) GPU (ORG) LLM (ORG) NCCL (ORG) SM (ORG) VCCL (ORG) CPU (ORG) NIC (ORG) O({\mu}s (LOCATION)

Originally published by arXiv CS Read original →

An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters

Related Stories

Link between poverty and access to nature | Letter

The Last Evolution, by John W Campbell Jr. (1932)

Genetically modified worms can now produce and deliver drugs inside a living body, scientists say

Indonesia Landslides Devastated Endangered Orangutans, Study Finds