Home Knowledge Base KV Cache

KV Cache

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

Announce Type: new Abstract: As the input length of large language model (LLM) serving continues to grow, the KV cache has become a dominant bottleneck in AI infrastructure. It limits GPU memory capacity, serving concurrency, cache reuse, and distributed scalability. Several important problems, including position-independent KV cache, prefix KV cache compression, hot/cold KV cache separation, and distributed KV cache management, all depend on how the KV cache is represented and managed.

arXiv CS 5d ago

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

arXiv:2606.08382v1 Announce Type: new Abstract: Low-rank projection has emerged as a promising approach for compressing the KV cache by exploiting hidden-dimension redundancy. However, prior methods rely on fixed or heuristic rank selection and struggle to achieve aggressive compression with minimal accuracy degradation. We propose STAR-KV, an adaptive low-rank KV cache compression framework with fine-grained rank control.

arXiv CS 1d ago

Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving

Announce Type: new Abstract: Large Language Model (LLM) inference relies on key-value (KV) caches to avoid redundant attention computation. While approximate KV cache retention techniques reduce memory usage by sacrificing model accuracy, lossless approaches instead evict KV cache blocks from GPU memory and reconstruct them on demand to preserve exact outputs. Existing lossless KV cache management systems primarily base eviction decisions on access frequency or positional heuristics, without...

arXiv CS 7d ago

GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

Announce Type: new Abstract: Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compression methods that enforce a fixed budget through eviction and merging. Modern eviction methods increasingly adopt span-based retention because preserving contiguous spans is empirically effective and better preserves semantic coherence.

arXiv CS 9d ago

Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving

arXiv:2606.06302v1 Announce Type: new Abstract: Multi-turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key-Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non-uniform KV compression effectively preserves more information by considering the individual importance of each KV cache. However, such KV cache heterogeneity introduces various systemic challenges - including memory fragmentation, scheduling...

arXiv CS 5d ago

Value-Aware Stochastic KV Cache Eviction for Reasoning Models

arXiv:2606.03928v1 Announce Type: new Abstract: Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selection-based sparse attention alternatives, which keep the full KV cache. We identify key factors crucial to KV cache eviction accuracy.

arXiv CS 7d ago

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

Announce Type: new Abstract: Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect...

arXiv CS 7d ago

Probing the Prompt KV Cache: Where It Becomes Dispensable

Announce Type: new Abstract: Prior KV cache compression schemes empirically demonstrate that the prompt cache is partially redundant during decoding, dropping or summarising entries with little accuracy loss. We ask when and what kind of redundancy: at which layers, after how many decoding steps, and in what form can the prompt span KV cache be replaced without breaking the task.

arXiv CS 9d ago

KVarN: Native vLLM backend for KV-cache quantization by Huawei

⚡️ Built for agentic and long-context workloads. 💡 KVarN delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, so you fit far longer contexts and serve more concurrent requests, with FP16-level accuracy. 🔌 Calibration-free, plug-and-play with vLLM.

Hacker News 6d ago

ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs

arXiv:2602.07721v3 Announce Type: replace Abstract: KV-cache retrieval is essential for long-context LLM inference, yet existing methods struggle with distribution drift and high latency at scale. We introduce ParisKV, a drift-robust, GPU-native KV-cache retrieval framework based on collision-based candidate selection, followed by a quantized inner-product reranking estimator. For million-token contexts, ParisKV supports CPU-offloaded KV caches via Unified Virtual Addressing (UVA), enabling...

arXiv CS 9d ago