Stochastic Sparse Attention for Memory-Bound Inference

arXiv CS Thursday 04 June 2026, 04:00 UTC By Kyle Lee, Corentin Delacour, Kevin Callahan-Coray, Kyle Jiang, Can Yaras, Samet Oymak, Tathagata Srimani, Kerem Y. Camsari 1 min read

Key Points

arXiv:2605.01910v2 Announce Type: replace Abstract: Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling $S \ll n_k$ indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified and systematic sampling to design variance-reduced, GPU-friendly variants. Evaluated on Llama-3.1-8B-Instruct at 32k-token contexts, S$^2$ANTA matches baseline accuracy while achieving up to $1.5\times$ decode-step attention-kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada. In batched long-context generation, these kernel gains translate to up to $1.25\times$ end-to-end decode-latency speedup. Finally, we propose Bernoulli $qK^\mathsf{T}$ sampling as a complementary technique to sparsify the score stage, reducing key-feature access through stochastic ternary queries. Both methods are complementary to upstream quantization, low-rank projection, KV-cache compression, and KV-cache selection methods. Together, they point toward sparse, multiplier-free, and energy-efficient inference. We open-source our kernels at: https://github.com/OPUSLab/SANTA.git

Autoregressive (ORG) KV (ORG) SANTA (LOCATION) GPU (ORG) FlashInfer (ORG) FlashDecoding (ORG) NVIDIA (ORG) Bernoulli (PERSON)

Originally published by arXiv CS Read original →

Stochastic Sparse Attention for Memory-Bound Inference

Related Stories

Indonesia Landslides Devastated Endangered Orangutans, Study Finds

Neuroscientist reveals the one 'superfood' he eats every single day to slow down ageing

Wall Street Braces for SpaceX IPO Friday

Indonesia Landslides Devastated Endangered Orangutans, Study Finds