Parallelizing Large-Scale Tensor Network Contraction on Multiple GPUs

arXiv CS Tuesday 02 June 2026, 04:00 UTC By Feng Pan, Hanfeng Gu, Paul Springer, Xipeng Li 1 min read

Key Points

arXiv:2606.01852v1 Announce Type: new Abstract: Exact tensor network contraction underpins quantum circuit simulation, quantum error correction, combinatorial optimization, and many-body dynamics. The dominant parallelization strategy, slicing, scales exponentially and incurs redundant computation. We present a multi-GPU framework that instead distributes intermediate tensors across devices with explicit communication, converting a fixed contraction path into a communication-efficient schedule via GEMM-oriented mode reordering and communication-aware mode distribution planning. Within a single DGX H100 node (8 GPUs, NVLink), distribution delivers $7$--$173\times$ extra speedup beyond embarrassingly parallel slicing, capturing nearly all of the available compute reduction (87--101%) because NVLink's high bandwidth keeps communication small relative to compute. Scaling the same four workloads to 1024 H100 GPUs over InfiniBand, the extra speedup beyond slicing ranges from $42\times$ to $67{,}869\times$, demonstrating that communication-aware distributed contraction far surpasses slicing-based scaling limits for frontier tensor networks.

GEMM (PERSON) DGX (ORG) NVLink (ORG)

Originally published by arXiv CS Read original →

Parallelizing Large-Scale Tensor Network Contraction on Multiple GPUs

Related Stories

Global watchdog calls for tighter controls on agentic AI in finance

Ant International Considers Raising $1 Billion to Boost Growth

Bill Gates to face questions from House committee over links to Jeffrey Epstein

Bill Gates to face questions from House committee over links to Jeffrey Epstein