Home Knowledge Base full+sparse

full+sparse

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Announce Type: replace Abstract: Sparse attention reduces the quadratic complexity of full self-attention but faces two challenges: (1) an attention gap, where applying sparse attention to full-attention-trained models causes performance degradation due to train-inference distribution mismatch, and (2) a capability gap, where models trained purely with sparse attention lack complete gradient flow, preventing them from matching full-attention performance. We propose SSA (Sparse Sparse...

arXiv CS 6d ago

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

arXiv:2511.20102v4 Announce Type: replace Abstract: Sparse attention reduces the quadratic complexity of full self-attention but faces two challenges: (1) an attention gap, where applying sparse attention to full-attention-trained models causes performance degradation due to train-inference distribution mismatch, and (2) a capability gap, where models trained purely with sparse attention lack complete gradient flow, preventing them from matching full-attention performance. We propose SSA...

arXiv CS 5d ago

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

arXiv:2605.16928v2 Announce Type: replace Abstract: Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal...

arXiv CS 1d ago

Value-Aware Stochastic KV Cache Eviction for Reasoning Models

arXiv:2606.03928v1 Announce Type: new Abstract: Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selection-based sparse attention alternatives, which keep the full KV cache. We identify key factors crucial to KV cache eviction accuracy.

arXiv CS 7d ago

OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation

Announce Type: new Abstract: Autoregressive (AR) video generation extends videos by producing latent chunks sequentially, but scaling to long videos requires repeated access to a growing historical KV cache. Existing methods reduce this cost by truncating the KV cache or compressing it into implicit memory, but both lose explicit access to query-relevant historical details. We propose OmniMem, an explicit full-range memory retrieval framework that performs sparse KV retrieval over the...

arXiv CS 9d ago

Perplexity Can Miss SAE Feature Damage Under Quantization

Announce Type: replace Abstract: Quantization is a standard path to deploying large language models, and quantized models are typically judged acceptable when perplexity or downstream accuracy remains close to the full-precision original. But behavioral parity need not imply feature fidelity: the sparse-autoencoder (SAE) features used to interpret a full-precision model may change after weight rounding.

arXiv CS 2d ago

Sparse Memory Finetuning as a Low-Forgetting Alternative to LoRA and Full Finetuning

Announce Type: replace Abstract: Adapting a pretrained language model to a new task often hurts the general capabilities it already had, a problem known as catastrophic forgetting. Sparse Memory Finetuning (SMF) tries to avoid this by adding key-value memory layers to the model and, on each training step, updating only the small set of memory rows that the current batch reads most heavily. We re-implement SMF on Qwen-2.5-0.5B-Instruct and compare it with LoRA and full finetuning on MedMCQA,...

arXiv CS 1d ago

Suboptimality bounds for trace-bounded SDPs enable a faster and scalable low-rank SDP solver SDPLR+

arXiv:2406.10407v3 Announce Type: replace-cross Abstract: Semidefinite programs (SDPs) and their solvers are powerful tools with many applications in machine learning and data science. Designing scalable SDP solvers is challenging because by standard the positive semidefinite decision variable is an $n \times n$ dense matrix, even though the input is often an $n \times n$ sparse matrix. However, the solution may not require a full-rank matrix, as shown by Barvinok and Pataki.

arXiv CS 7d ago

SAGE: Scalable AI Governance & Evaluation

Announce Type: replace Abstract: Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework...

arXiv CS 5d ago

SparseX: Efficient Segment-Level KV Cache Sharing for Interleaved LLM Serving

arXiv:2606.01751v1 Announce Type: new Abstract: In long-context LLM serving, the prefill stage often dominates time-to-first-token and computational cost. Although Prefix Cache in vLLM/PagedAttention has been widely used to reuse identical prompt prefixes, repeated content in practical applications frequently appears as non-prefix, cross-request, cross-turn, and cross-agent segments, which makes conventional cache mechanisms insufficient. This paper presents SparseX, a segment-level KV Cache...

arXiv CS 8d ago