Home Knowledge Base AllReduce

AllReduce

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Don't Let a Few Network Failures Slow the Entire AllReduce

arXiv:2606.01680v1 Announce Type: new Abstract: Network failures are among the most frequent hardware faults in large-scale GPU clusters and a leading cause of training-job interruptions. Modern collective communication libraries such as NCCL mitigate network failures by rerouting traffic through surviving NICs on the same server, trading reduced inter-node bandwidth for uninterrupted training. However, the degraded server remains on the critical path of the standard ring algorithm, slowing...

arXiv CS 8d ago

HetCCL: Enabling Collective Communication For Mixed-Vendor Heterogeneous Clusters

arXiv:2605.31000v1 Announce Type: new Abstract: Training Large Language Models (LLMs) on heterogeneous clusters presents significant challenges for collective communication, as hardware from multiple vendors introduces diverse network and computational characteristics. Existing collective communication frameworks (e.g., NCCL, RCCL) designed for homogeneous environments fail to address mixed-hardware setups, while communication libraries with heterogeneous support (e.g., Gloo, OpenMPI) incur...

arXiv CS 9d ago