K/V
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Do Transformers Need Three Projections? Systematic Study of QKV Variants
Announce Type: new Abstract: Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection).
Do Transformers Need Three Projections? Systematic Study of QKV Variants
arXiv:2606.04032v2 Announce Type: replace Abstract: Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection).
Revenue Guarantees of No-Swap-Regret Dynamics in First Price Auctions
arXiv:2606.06085v1 Announce Type: new Abstract: We study the revenue of approximate correlated equilibrium in discrete first price auctions - the set of allowable bids is $\mathcal{B} = \{0, 1/k, \dots, 1 - 1/k, 1\}$ for some $k \in \mathbb{N}$. We show that the revenue of any $\epsilon$-approximate correlated equilibrium is at least $v_2 - \Theta(1/k)- \Theta(\epsilon k^2)$, where $v_2 \geq 0$ is the second-highest valuation. Our results establish the first polynomial convergence rates on...
A Temporal Spatial Minimax Rate for Smoothly-Varying Distributions in Wasserstein Space
arXiv:2606.07325v1 Announce Type: cross Abstract: We study the minimax rate of estimating a future value $\mu_{t_n+h}$ of a curve $t\mapsto\mu_t$ in the $2$-Wasserstein space $\mathcal{P}_2(\mathbb{R}^d)$ from finitely many noisy snapshots of its past, under an adiabatic bound $\|\nabla_t^k v\|\le\varepsilon$ on the $k$-th covariant derivative of the velocity field. Our central result is a unified temporal-spatial minimax lower bound: over regular, locally transport-rich subclasses, every...
Do Transformers Need Three Projections? Systematic Study of QKV Variants
Computer Science > Machine Learning [Submitted on 1 Jun 2026] Title:Do Transformers Need Three Projections? Systematic Study of QKV Variants View PDF HTML (experimental)Abstract:Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role.
A Kronecker algorithm for locally closed sets over a perfect field
arXiv:2512.14888v2 Announce Type: replace-cross Abstract: We develop a probabilistic algorithm of Kronecker type for computing a Kronecker representation of a zero-dimensional linear section of an algebraic variety $V$ defined over a perfect field $k$. The variety $V$ is the Zariski closure of the set of common zeros $\{F_1=0,\ldots,F_r=0,G\not=0\}$ of multivariate polynomials $F_1,\ldots,F_r\in k[X_1,\ldots,X_n]$ outside a prescribed hypersurface $\{G=0\}$. We assume that $F_1,\ldots,F_r$...
Almost balanced ordered biclique covering of graphs
arXiv:2606.08506v1 Announce Type: cross Abstract: Let $f(n,k)$ be the minimum size of a collection of bicliques such that (i) every edge of the complete graph $K_n$ is covered by at least one and at most $k$ bicliques in the collection, and (ii) for each edge $\{u,v\}$, the number of bicliques in which $u$ appears in the first class and $v$ in the second class differs by at most one from the number of bicliques in which $u$ appears in the second class and $v$ in the first class. For $k=1$,...
Decomposing tournaments into comparability graphs
Announce Type: cross Abstract: In this note, we introduce the \emph{partial order decomposition number} of a digraph $D$, denoted $pod(D)$, defined as the minimum integer $k$ such that $A(D)=A(P_1)\cup\cdots\cup A(P_k)$, where $P_1,\ldots,P_k$ are partial orders on $V(D)$. We prove that $\dic(D)\le \diomega(D)^{pod(D)}$ for every digraph $D$. In particular, every class of digraphs with bounded $pod$ is polynomially $\dic$-bounded. We apply this to tournaments, showing that if $\mathcal C$ is...
BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies
arXiv:2605.30660v1 Announce Type: new Abstract: Test-time scaling for vision-language-action (VLA) policies, methods such as RoboMonkey, SEAL, MG-Select, and V-GPS, samples K candidate action chunks at inference and executes the verifier-best. When all K candidates are unsafe, the system executes a violating action with no warning. We propose BOKBO, the first conformal abstention layer for K-sample VLA inference, providing finite-sample distribution-free guarantees on executed-violation rate.
CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability
arXiv:2606.01495v2 Announce Type: replace Abstract: We present CART (Context-Anchored Recurrent Transformer), a parameter-efficient language model that reuses a single shared core block R times across depth. Unlike prior looped transformers that recompute key-value tensors at every iteration, CART computes K and V once from a multi-layer prelude and has the recurrent core cross-attend to those frozen tensors via multi-head latent attention. A learned Linear Time-Invariant (LTI) gate keeps...