Softmax Attention
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Specialization of softmax attention heads: insights from the high-dimensional single-location model
arXiv:2603.03993v2 Announce Type: replace Abstract: Multi-head attention enables transformer models to represent multiple attention patterns simultaneously. Empirically, head specialization emerges in distinct stages during training, while many heads remain redundant and learn similar representations. We propose a theoretical model capturing this phenomenon, based on the multi-index and single-location regression frameworks.
Customizing the Inductive Biases of Softmax Attention using Structured Matrices
arXiv:2509.07963v2 Announce Type: replace Abstract: The core component of attention is the scoring function, which transforms the inputs into low-dimensional queries and keys and takes the dot product of each pair. While the low-dimensional projection improves efficiency, it causes information loss for certain tasks that have intrinsically high-dimensional inputs. Additionally, attention uses the same scoring function for all input pairs, without imposing a distance-dependent compute bias...
Don't Read Everything: A Curvature-Conditioned Query for Linear Attention
Announce Type: new Abstract: Linear attention reduces the quadratic cost of softmax attention by maintaining a recurrent fast-weight state, but it consistently lags on in-context retrieval and long-context tasks. Existing remedies act on the write side of memory through gating, delta updates, or kernel feature maps, but the read step is left unchanged: every past key contributes additively to the output, so useful targets are diluted by the bulk of stored vectors. We borrow one specific...
Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models
arXiv:2602.03681v2 Announce Type: replace Abstract: The quadratic computational complexity of softmax transformers has become a bottleneck in long-context scenarios. In contrast, linear attention model families provide a promising direction towards a more efficient sequential model. These linear attention models compress past KV values into a single hidden state, thereby efficiently reducing complexity during both training and inference.
Component Ablation for Efficient Hybrid Language Model Architectures: Performance, Resilience, and Compression Implications
arXiv:2603.22473v2 Announce Type: replace Abstract: Hybrid language models combine softmax attention with linear-time sequence mechanisms such as state-space or linear-attention layers, but the functional contribution of each component type remains insufficiently characterized. We study component-level ablation in two sub-1B hybrid language models, Qwen3.5-0.8B and Falcon-H1-0.5B, using likelihood-based evaluation, downstream benchmarks, layer-wise interventions, random controls, and...
A Drug-Target Specificity Foundation Model for Off-target Prediction, Repurposing, and Generative Design
Molecular recognition - which small molecule binds which protein, and with what selectivity - governs the efficacy, safety, and discovery of every therapeutic, yet binding specificity is still determined by experimental screening or by computational methods that first predict three-dimensional structure. Transformer softmax attention is mathematically isomorphic to the Boltzmann distribution governing molecular binding at thermal equilibrium, an identity that prescribes a single...
Normalization Equivariance for Arbitrary Backbones, with Application to Image Denoising
arXiv:2605.08193v3 Announce Type: replace Abstract: Normalization Equivariance (NE) is a structural prior that improves robustness to distribution shift in image-to-image tasks. A function $f$ is normalization equivariant iff $f(a y + b\mathbf{1}) = a f(y) + b\mathbf{1}$ for all $a>0$ and $b\in\mathbb{R}$. Existing NE methods constrain every internal layer to NE-compatible operations. These constraints add runtime cost and exclude standard transformer components such as softmax attention and...
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
arXiv:2506.05233v2 Announce Type: replace Abstract: Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM.
A Unifying View of Attention Sinks: Two Algorithms, Two Solutions
Announce Type: new Abstract: When attention concentrates on a single token, a sink, what is the model actually computing? Attention sinks are ubiquitous in softmax transformers, yet this shared visual signature can hide fundamentally different algorithms. We show that visually similar sink patterns can reflect two distinct mechanisms: {i} adaptive nop, where a head suppresses its update by routing to a null token, and {ii} broadcast, where a sink aggregates and redistributes global information.
Capacity-Controlled Global Attention for Graph Transformers
arXiv:2604.17324v2 Announce Type: replace Abstract: Global self-attention drives modern graph transformers, yet the softmax at its core imposes a structural constraint rarely examined directly: every attention row is non-negative and sums to one, so each per-head output is a mass-conserving convex combination of value vectors. A node can never "attend to nothing." We argue this conservation constraint is a single root cause behind three pathologies usually studied in isolation: the collapse...