Home Knowledge Base KV Access

KV Access

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation

Announce Type: new Abstract: Autoregressive (AR) video generation extends videos by producing latent chunks sequentially, but scaling to long videos requires repeated access to a growing historical KV cache. Existing methods reduce this cost by truncating the KV cache or compressing it into implicit memory, but both lose explicit access to query-relevant historical details. We propose OmniMem, an explicit full-range memory retrieval framework that performs sparse KV retrieval over the...

arXiv CS 9d ago

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

arXiv:2606.00024v2 Announce Type: replace Abstract: Long-context decoding in Large Language Models (LLMs) is constrained by the cost of accessing and processing the Key-Value (KV) cache. Despite the evidence that attention outputs depend jointly on keys and values, most existing KV management methods rely on key-only pruning, as incorporating values incurs prohibitive additional overhead. In this paper, we propose Attention Run-time Termination (ART), a lightweight run-time mechanism that...

arXiv CS 1d ago

MAGE: All-[MASK] Block Already Knows Where to Look in Block Diffusion LLM

arXiv:2602.14209v2 Announce Type: replace Abstract: Block diffusion LLMs are an emerging paradigm for parallel language generation, but their KV caching makes memory access the dominant bottleneck in long-context inference. Sparse attention, which attends only to a small KV subset per query, can reduce this latency with minimal accuracy loss. In block diffusion, however, the B tokens of each block must share a single KV subset, and we show this per-block constraint degrades existing sparse...

arXiv CS 2d ago

Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving

Announce Type: new Abstract: Large Language Model (LLM) inference relies on key-value (KV) caches to avoid redundant attention computation. While approximate KV cache retention techniques reduce memory usage by sacrificing model accuracy, lossless approaches instead evict KV cache blocks from GPU memory and reconstruct them on demand to preserve exact outputs. Existing lossless KV cache management systems primarily base eviction decisions on access frequency or positional heuristics, without...

arXiv CS 7d ago

FlexNPU: Transparent NPU Virtualization for Dynamic LLM Prefill-Decode Co-location

Announce Type: new Abstract: Modern AI serving increasingly relies on NPUs for conventional inference and large language model serving. However, current NPU deployments commonly expose physical devices directly to applications, which limits runtime control over scheduling and makes it difficult to adapt execution to phase-level workload behavior. This limitation is particularly evident in LLM serving, where the prefill phase is compute-intensive while the decode phase is often constrained by...

arXiv CS 6d ago

FlexNPU: Transparent NPU Virtualization for Dynamic LLM Prefill-Decode Co-location

arXiv:2606.04415v2 Announce Type: replace Abstract: Modern AI serving increasingly relies on NPUs for conventional inference and large language model serving. However, current NPU deployments commonly expose physical devices directly to applications, which limits runtime control over scheduling and makes it difficult to adapt execution to phase-level workload behavior. This limitation is particularly evident in LLM serving, where the prefill phase is compute-intensive while the decode phase...

arXiv CS 1d ago

Stochastic Sparse Attention for Memory-Bound Inference

arXiv:2605.01910v2 Announce Type: replace Abstract: Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling $S \ll n_k$ indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while...

arXiv CS 6d ago

KV-Control: Parameter-Efficient K/V Injection for Trajectory-Controlled Text-to-Motion

arXiv:2606.05624v1 Announce Type: new Abstract: Text-conditioned 3D human motion models now synthesize plausible motions from prompts, but practical animation and embodied-agent workflows rarely stop at text: a character may need to follow a sketched root path, hit an end-effector target, or satisfy a multi-joint trajectory while still preserving the gait, style, and intent described by language. This exposes a control trade-off. A trajectory controller should be precise without overwriting...

arXiv CS 5d ago

GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

arXiv:2603.13875v2 Announce Type: replace Abstract: Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is compressive memory: read a context once, store it in a compact state, and answer many queries from that state.

arXiv CS 9d ago

In situ nanocrystal confinement for efficient blue perovskite LEDs

Abstract Metal halide perovskites have emerged as promising semiconductors for light-emitting diodes (LEDs) owing to their excellent luminescence properties1. However, their performance remains limited, primarily owing to the inherent contradiction between ‘high crystallinity’ and ‘small size’ in the in situ synthesis of perovskite nanocrystals on substrates. Here we report efficient blue perovskite LEDs (PeLEDs) achieved via in situ polymerization-driven nanocrystal confinement to...

Nature 22h ago