Home Knowledge Base NCCL

NCCL

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Don't Let a Few Network Failures Slow the Entire AllReduce

arXiv:2606.01680v1 Announce Type: new Abstract: Network failures are among the most frequent hardware faults in large-scale GPU clusters and a leading cause of training-job interruptions. Modern collective communication libraries such as NCCL mitigate network failures by rerouting traffic through surviving NICs on the same server, trading reduced inter-node bandwidth for uninterrupted training. However, the degraded server remains on the critical path of the standard ring algorithm, slowing...

arXiv CS 8d ago

An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters

Announce Type: replace Abstract: Large-scale LLM training requires collective communication libraries to exchange data among distributed GPUs. As a company dedicated to building and operating large-scale GPU training clusters, we encounter several practical limitations of NCCL in production, including 1) SM competition between computation and communication, 2) expensive restart costs under link failures, and 3) insufficient observability of transient collective communication anomalies. To...

arXiv CS 8d ago

HetCCL: Enabling Collective Communication For Mixed-Vendor Heterogeneous Clusters

arXiv:2605.31000v1 Announce Type: new Abstract: Training Large Language Models (LLMs) on heterogeneous clusters presents significant challenges for collective communication, as hardware from multiple vendors introduces diverse network and computational characteristics. Existing collective communication frameworks (e.g., NCCL, RCCL) designed for homogeneous environments fail to address mixed-hardware setups, while communication libraries with heterogeneous support (e.g., Gloo, OpenMPI) incur...

arXiv CS 9d ago

StageFrontier: Synchronization-Aware Stage Accounting for Distributed ML Training

new Abstract: When a distributed training job slows down, the hard part is knowing where to look. Synchronization hides the cause: a stall on one rank shows up as a wait on the others, so a data delay on a single rank can surface as backward time across the group. The cheap dashboards that run all the time -- per-stage averages and maxima -- misread this, double-counting the same exposed delay or burying the slow rank in an average, while full profilers see it clearly but are far too heavy...

arXiv CS 2d ago