Home Knowledge Base Synchronization-Aware Stage Accounting for Distributed ML Training

Synchronization-Aware Stage Accounting for Distributed ML Training

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

StageFrontier: Synchronization-Aware Stage Accounting for Distributed ML Training

new Abstract: When a distributed training job slows down, the hard part is knowing where to look. Synchronization hides the cause: a stall on one rank shows up as a wait on the others, so a data delay on a single rank can surface as backward time across the group. The cheap dashboards that run all the time -- per-stage averages and maxima -- misread this, double-counting the same exposed delay or burying the slow rank in an average, while full profilers see it clearly but are far too heavy...

arXiv CS 2d ago