NCCL
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Don't Let a Few Network Failures Slow the Entire AllReduce
arXiv:2606.01680v1 Announce Type: new Abstract: Network failures are among the most frequent hardware faults in large-scale GPU clusters and a leading cause of training-job interruptions. Modern collective communication libraries such as NCCL mitigate network failures by rerouting traffic through surviving NICs on the same server, trading reduced inter-node bandwidth for uninterrupted training. However, the degraded server remains on the critical path of the standard ring algorithm, slowing...
An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters
Announce Type: replace Abstract: Large-scale LLM training requires collective communication libraries to exchange data among distributed GPUs. As a company dedicated to building and operating large-scale GPU training clusters, we encounter several practical limitations of NCCL in production, including 1) SM competition between computation and communication, 2) expensive restart costs under link failures, and 3) insufficient observability of transient collective communication anomalies. To...
HetCCL: Enabling Collective Communication For Mixed-Vendor Heterogeneous Clusters
arXiv:2605.31000v1 Announce Type: new Abstract: Training Large Language Models (LLMs) on heterogeneous clusters presents significant challenges for collective communication, as hardware from multiple vendors introduces diverse network and computational characteristics. Existing collective communication frameworks (e.g., NCCL, RCCL) designed for homogeneous environments fail to address mixed-hardware setups, while communication libraries with heterogeneous support (e.g., Gloo, OpenMPI) incur...
StageFrontier: Synchronization-Aware Stage Accounting for Distributed ML Training
new Abstract: When a distributed training job slows down, the hard part is knowing where to look. Synchronization hides the cause: a stall on one rank shows up as a wait on the others, so a data delay on a single rank can surface as backward time across the group. The cheap dashboards that run all the time -- per-stage averages and maxima -- misread this, double-counting the same exposed delay or burying the slow rank in an average, while full profilers see it clearly but are far too heavy...