Home › Knowledge Base › Pipeline Parallelism

Pipeline Parallelism

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Learned Subspace Compression for Communication-Efficient Pipeline Parallelism

Announce Type: new Abstract: Pipeline parallelism enables training of large language models that exceed single-device memory, yet inter-stage activation communication becomes the dominant bottleneck when trained on low-bandwidth networks. Recent work in this area has proposed using fixed orthogonal projections to compress activations. However, this still results in a significant performance degradation and requires a number of non-standard adaptations to constrain the optimization.

arXiv CS 5d ago

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

arXiv:2606.07881v1 Announce Type: new Abstract: Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer from bubbles; asynchronous pipelines remove bubbles but introduce weight-version mismatch, typically requiring weight stashing, prediction, or correction mechanisms. We introduce PACI (Pipeline Asynchronous training...

arXiv CS 1d ago

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

arXiv:2605.30852v1 Announce Type: new Abstract: Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency.

arXiv CS 9d ago

Demystifying Pipeline Parallelism: First Theory for PipeDream

arXiv:2606.03498v1 Announce Type: new Abstract: Training modern machine learning models increasingly requires computation to be distributed across many accelerators. Data parallelism remains the default choice and is often paired with tensor-parallel sharding, but model parallelism becomes unavoidable once parameters, activations, or optimizer states no longer fit on a single device. This paper studies pipeline model parallelism through the lens of PipeDream (PD) (Harlap et al., 2018).

arXiv CS 7d ago

A Low-Cost, High-Throughput Design-Build-Test Pipeline for Engineering Genetic Systems: Stress Testing with Complex Structural Proteins

Genetic systems engineering is constrained by high DNA synthesis costs, assembly inefficiencies, and challenges in expressing complex proteins. To address these limitations, we developed a highly parallel, low-cost pipeline for the design, assembly, and functional screening of genetic systems, which we stress-tested on highly repetitive structural proteins, including spider silk, biocements, reflectins, and talins. The integrated pipeline combines computational genetic systems design,...

bioRxiv 1d ago

MURMUR: An Efficient Inference System for Long-Form ASR

arXiv:2606.01483v1 Announce Type: new Abstract: Long-form automatic speech recognition (ASR) requires both high accuracy and low latency, but existing systems force a trade-off between the two. Chunk-based pipelines process audio in parallel windows for low latency, but lose cross-chunk context and need brittle heuristics to align speakers and timestamps at boundaries. Long-context ASR models resolve everything in a single pass for better accuracy, but are an order of magnitude slower.

arXiv CS 8d ago

Anatomy of a high-performance EP kernel

Anatomy of a high-performance EP kernel Large language models are large. Because they’re large, we need lots of GPUs to run them. It would be nice if LLM inference were ‘embarrassingly parallel’ and we could just always compute independent things on different GPUs.

Hacker News 4h ago

Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

Announce Type: replace Abstract: Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code...

arXiv CS 5d ago

SET: Stream-Event-Triggered Scheduling for Efficient CUDA Graph Pipelines

arXiv:2606.05495v1 Announce Type: new Abstract: Achieving peak GPU performance remains a significant challenge as the system throughput is constrained by host-device synchronization delays and kernel scheduling overheads, even with aggressive kernel optimizations and batch processing. Furthermore, existing approaches often underutilize hardware resources such as compute cores and copy engines due to scheduling overheads. To address these problems, we propose a CUDA runtime framework for...

arXiv CS 5d ago

NVIDIA Isaac Sim: Enabling Scalable, GPU-Accelerated Simulation for Robotics

Announce Type: new Abstract: Simulation has become a core infrastructure for robotics research. Unlike previous simulators, NVIDIA Isaac Sim leverages GPU acceleration to enable large-scale parallel training and physics-accurate modeling. Its synthetic data generation pipeline alleviates the scarcity of high-quality training data, supporting data-driven robot learning and large-scale simulation-centric experimentation.

arXiv CS 7d ago