Pipeline Parallelism
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Learned Subspace Compression for Communication-Efficient Pipeline Parallelism
Announce Type: new Abstract: Pipeline parallelism enables training of large language models that exceed single-device memory, yet inter-stage activation communication becomes the dominant bottleneck when trained on low-bandwidth networks. Recent work in this area has proposed using fixed orthogonal projections to compress activations. However, this still results in a significant performance degradation and requires a number of non-standard adaptations to constrain the optimization.
Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency
arXiv:2606.07881v1 Announce Type: new Abstract: Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer from bubbles; asynchronous pipelines remove bubbles but introduce weight-version mismatch, typically requiring weight stashing, prediction, or correction mechanisms. We introduce PACI (Pipeline Asynchronous training...
Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism
arXiv:2605.30852v1 Announce Type: new Abstract: Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency.
Demystifying Pipeline Parallelism: First Theory for PipeDream
arXiv:2606.03498v1 Announce Type: new Abstract: Training modern machine learning models increasingly requires computation to be distributed across many accelerators. Data parallelism remains the default choice and is often paired with tensor-parallel sharding, but model parallelism becomes unavoidable once parameters, activations, or optimizer states no longer fit on a single device. This paper studies pipeline model parallelism through the lens of PipeDream (PD) (Harlap et al., 2018).
A Low-Cost, High-Throughput Design-Build-Test Pipeline for Engineering Genetic Systems: Stress Testing with Complex Structural Proteins
Genetic systems engineering is constrained by high DNA synthesis costs, assembly inefficiencies, and challenges in expressing complex proteins. To address these limitations, we developed a highly parallel, low-cost pipeline for the design, assembly, and functional screening of genetic systems, which we stress-tested on highly repetitive structural proteins, including spider silk, biocements, reflectins, and talins. The integrated pipeline combines computational genetic systems design,...
MURMUR: An Efficient Inference System for Long-Form ASR
arXiv:2606.01483v1 Announce Type: new Abstract: Long-form automatic speech recognition (ASR) requires both high accuracy and low latency, but existing systems force a trade-off between the two. Chunk-based pipelines process audio in parallel windows for low latency, but lose cross-chunk context and need brittle heuristics to align speakers and timestamps at boundaries. Long-context ASR models resolve everything in a single pass for better accuracy, but are an order of magnitude slower.
Anatomy of a high-performance EP kernel
Anatomy of a high-performance EP kernel Large language models are large. Because they’re large, we need lots of GPUs to run them. It would be nice if LLM inference were ‘embarrassingly parallel’ and we could just always compute independent things on different GPUs.
Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation
Announce Type: replace Abstract: Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code...
SET: Stream-Event-Triggered Scheduling for Efficient CUDA Graph Pipelines
arXiv:2606.05495v1 Announce Type: new Abstract: Achieving peak GPU performance remains a significant challenge as the system throughput is constrained by host-device synchronization delays and kernel scheduling overheads, even with aggressive kernel optimizations and batch processing. Furthermore, existing approaches often underutilize hardware resources such as compute cores and copy engines due to scheduling overheads. To address these problems, we propose a CUDA runtime framework for...
NVIDIA Isaac Sim: Enabling Scalable, GPU-Accelerated Simulation for Robotics
Announce Type: new Abstract: Simulation has become a core infrastructure for robotics research. Unlike previous simulators, NVIDIA Isaac Sim leverages GPU acceleration to enable large-scale parallel training and physics-accurate modeling. Its synthetic data generation pipeline alleviates the scarcity of high-quality training data, supporting data-driven robot learning and large-scale simulation-centric experimentation.