Home › Knowledge Base › Tensor Parallelism

Tensor Parallelism

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Scaling Neural Network Verification with Tensor Parallelism and Fully Sharded Data Parallelism

arXiv:2606.09377v1 Announce Type: new Abstract: Formal neural network verification -- proving that a network satisfies safety properties for \emph{all} inputs in a specified domain -- is bounded in practice by GPU memory: standard implementations of bound-propagation algorithms (IBP, CROWN, $\alpha$-CROWN) require weight and relaxation-coefficient matrices to reside entirely on one accelerator. We adapt two parallelism techniques originally developed for large-scale model training to the...

arXiv CS 1d ago

Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch

arXiv:2511.17826v2 Announce Type: replace Abstract: Deterministic inference is increasingly critical for large language model (LLM) applications such as LLM-as-a-judge evaluation, multi-agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the...

arXiv CS 9d ago

Parallelizing Large-Scale Tensor Network Contraction on Multiple GPUs

arXiv:2606.01852v1 Announce Type: new Abstract: Exact tensor network contraction underpins quantum circuit simulation, quantum error correction, combinatorial optimization, and many-body dynamics. The dominant parallelization strategy, slicing, scales exponentially and incurs redundant computation. We present a multi-GPU framework that instead distributes intermediate tensors across devices with explicit communication, converting a fixed contraction path into a communication-efficient...

arXiv CS 8d ago

Demystifying Pipeline Parallelism: First Theory for PipeDream

arXiv:2606.03498v1 Announce Type: new Abstract: Training modern machine learning models increasingly requires computation to be distributed across many accelerators. Data parallelism remains the default choice and is often paired with tensor-parallel sharding, but model parallelism becomes unavoidable once parameters, activations, or optimizer states no longer fit on a single device. This paper studies pipeline model parallelism through the lens of PipeDream (PD) (Harlap et al., 2018).

arXiv CS 7d ago

Scaling LLM Inference Beyond Amdahl`s Limits via Eliminating Non-Scalable Overheads

arXiv:2606.01927v1 Announce Type: new Abstract: Deployers of online LLM services usually seek to maximize cluster-wide performance given a fixed number of GPUs. Tensor parallelism (TP) is necessary to fit modern models but scales sub-linearly as the TP degree t grows, due to cross-GPU communication and non-scalable runtime work, as predicted by Amdahl's Law. Conversely, increasing t improves memory efficiency and alleviates KV-cache contention and swapping.

arXiv CS 8d ago

Anatomy of a high-performance EP kernel

Anatomy of a high-performance EP kernel Large language models are large. Because they’re large, we need lots of GPUs to run them. It would be nice if LLM inference were ‘embarrassingly parallel’ and we could just always compute independent things on different GPUs.

Hacker News 5h ago

A space-time sparse-grid method for the wave equation

arXiv:2606.09688v1 Announce Type: new Abstract: We develop a fast space-time numerical scheme for approximating solutions to the linear wave equation. The approach is based on the sparse-grid combination technique applied to a coercive space-time discretization. Designed for tensor-product space-time discretizations, the method enables efficient parallelization of the resulting solver.

arXiv CS 1d ago

A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)

A 10 year old Xeon is all you need 17 minutes read The previous post covered getting Gemma 4’s MTP drafters quantized and paired with a verifier. This one is about running the result on a machine that has no business running it. I have a recycled server.

Hacker News 9d ago

Nvidia Cosmos 3

Physical AI systems must understand the real world before they can act within it. Robots, autonomous vehicles, and smart spaces need to understand what’s happening in their world, predict what’s likely to happen next, and generate actions for specific environments, embodiments, and tasks. NVIDIA Cosmos 3 is a frontier foundation model for physical AI that combines physical reasoning, world generation, and action generation within a single open model.

Hacker News 9d ago

Deconstructing the Composite Channel for Beyond Diagonal RIS: Channel Estimation and Beamforming Design

arXiv:2606.01564v1 Announce Type: cross Abstract: As beyond-diagonal reconfigurable intelligent surfaces (BD-RISs) gain increasing attention in high-frequency wireless communications, accurate and scalable channel-estimation methods become essential. This paper develops a parametric channel-estimation and beamforming framework that deconstructs the composite BD-RIS channel into its generating directional factors, revealing the tensor structure induced jointly by propagation geometry and...

arXiv CS 8d ago