VCCL
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters
Announce Type: replace Abstract: Large-scale LLM training requires collective communication libraries to exchange data among distributed GPUs. As a company dedicated to building and operating large-scale GPU training clusters, we encounter several practical limitations of NCCL in production, including 1) SM competition between computation and communication, 2) expensive restart costs under link failures, and 3) insufficient observability of transient collective communication anomalies. To...