Tensor Parallelism
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Scaling Neural Network Verification with Tensor Parallelism and Fully Sharded Data Parallelism
arXiv:2606.09377v1 Announce Type: new Abstract: Formal neural network verification -- proving that a network satisfies safety properties for \emph{all} inputs in a specified domain -- is bounded in practice by GPU memory: standard implementations of bound-propagation algorithms (IBP, CROWN, $\alpha$-CROWN) require weight and relaxation-coefficient matrices to reside entirely on one accelerator. We adapt two parallelism techniques originally developed for large-scale model training to the...
Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch
arXiv:2511.17826v2 Announce Type: replace Abstract: Deterministic inference is increasingly critical for large language model (LLM) applications such as LLM-as-a-judge evaluation, multi-agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the...
Parallelizing Large-Scale Tensor Network Contraction on Multiple GPUs
arXiv:2606.01852v1 Announce Type: new Abstract: Exact tensor network contraction underpins quantum circuit simulation, quantum error correction, combinatorial optimization, and many-body dynamics. The dominant parallelization strategy, slicing, scales exponentially and incurs redundant computation. We present a multi-GPU framework that instead distributes intermediate tensors across devices with explicit communication, converting a fixed contraction path into a communication-efficient...
Demystifying Pipeline Parallelism: First Theory for PipeDream
arXiv:2606.03498v1 Announce Type: new Abstract: Training modern machine learning models increasingly requires computation to be distributed across many accelerators. Data parallelism remains the default choice and is often paired with tensor-parallel sharding, but model parallelism becomes unavoidable once parameters, activations, or optimizer states no longer fit on a single device. This paper studies pipeline model parallelism through the lens of PipeDream (PD) (Harlap et al., 2018).
Scaling LLM Inference Beyond Amdahl`s Limits via Eliminating Non-Scalable Overheads
arXiv:2606.01927v1 Announce Type: new Abstract: Deployers of online LLM services usually seek to maximize cluster-wide performance given a fixed number of GPUs. Tensor parallelism (TP) is necessary to fit modern models but scales sub-linearly as the TP degree t grows, due to cross-GPU communication and non-scalable runtime work, as predicted by Amdahl's Law. Conversely, increasing t improves memory efficiency and alleviates KV-cache contention and swapping.
Anatomy of a high-performance EP kernel
Anatomy of a high-performance EP kernel Large language models are large. Because they’re large, we need lots of GPUs to run them. It would be nice if LLM inference were ‘embarrassingly parallel’ and we could just always compute independent things on different GPUs.
A space-time sparse-grid method for the wave equation
arXiv:2606.09688v1 Announce Type: new Abstract: We develop a fast space-time numerical scheme for approximating solutions to the linear wave equation. The approach is based on the sparse-grid combination technique applied to a coercive space-time discretization. Designed for tensor-product space-time discretizations, the method enables efficient parallelization of the resulting solver.
A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)
A 10 year old Xeon is all you need 17 minutes read The previous post covered getting Gemma 4’s MTP drafters quantized and paired with a verifier. This one is about running the result on a machine that has no business running it. I have a recycled server.
Nvidia Cosmos 3
Physical AI systems must understand the real world before they can act within it. Robots, autonomous vehicles, and smart spaces need to understand what’s happening in their world, predict what’s likely to happen next, and generate actions for specific environments, embodiments, and tasks. NVIDIA Cosmos 3 is a frontier foundation model for physical AI that combines physical reasoning, world generation, and action generation within a single open model.
Deconstructing the Composite Channel for Beyond Diagonal RIS: Channel Estimation and Beamforming Design
arXiv:2606.01564v1 Announce Type: cross Abstract: As beyond-diagonal reconfigurable intelligent surfaces (BD-RISs) gain increasing attention in high-frequency wireless communications, accurate and scalable channel-estimation methods become essential. This paper develops a parametric channel-estimation and beamforming framework that deconstructs the composite BD-RIS channel into its generating directional factors, revealing the tensor structure induced jointly by propagation geometry and...