Home › Knowledge Base › K/V

K/V

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Announce Type: new Abstract: Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection).

arXiv CS 6d ago

Do Transformers Need Three Projections? Systematic Study of QKV Variants

arXiv:2606.04032v2 Announce Type: replace Abstract: Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection).

arXiv CS 5d ago

Revenue Guarantees of No-Swap-Regret Dynamics in First Price Auctions

arXiv:2606.06085v1 Announce Type: new Abstract: We study the revenue of approximate correlated equilibrium in discrete first price auctions - the set of allowable bids is $\mathcal{B} = \{0, 1/k, \dots, 1 - 1/k, 1\}$ for some $k \in \mathbb{N}$. We show that the revenue of any $\epsilon$-approximate correlated equilibrium is at least $v_2 - \Theta(1/k)- \Theta(\epsilon k^2)$, where $v_2 \geq 0$ is the second-highest valuation. Our results establish the first polynomial convergence rates on...

arXiv CS 5d ago

A Temporal Spatial Minimax Rate for Smoothly-Varying Distributions in Wasserstein Space

arXiv:2606.07325v1 Announce Type: cross Abstract: We study the minimax rate of estimating a future value $\mu_{t_n+h}$ of a curve $t\mapsto\mu_t$ in the $2$-Wasserstein space $\mathcal{P}_2(\mathbb{R}^d)$ from finitely many noisy snapshots of its past, under an adiabatic bound $\|\nabla_t^k v\|\le\varepsilon$ on the $k$-th covariant derivative of the velocity field. Our central result is a unified temporal-spatial minimax lower bound: over regular, locally transport-rich subclasses, every...

arXiv CS 2d ago

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Computer Science > Machine Learning [Submitted on 1 Jun 2026] Title:Do Transformers Need Three Projections? Systematic Study of QKV Variants View PDF HTML (experimental)Abstract:Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role.

Hacker News 5d ago

A Kronecker algorithm for locally closed sets over a perfect field

arXiv:2512.14888v2 Announce Type: replace-cross Abstract: We develop a probabilistic algorithm of Kronecker type for computing a Kronecker representation of a zero-dimensional linear section of an algebraic variety $V$ defined over a perfect field $k$. The variety $V$ is the Zariski closure of the set of common zeros $\{F_1=0,\ldots,F_r=0,G\not=0\}$ of multivariate polynomials $F_1,\ldots,F_r\in k[X_1,\ldots,X_n]$ outside a prescribed hypersurface $\{G=0\}$. We assume that $F_1,\ldots,F_r$...

arXiv CS 1d ago

Almost balanced ordered biclique covering of graphs

arXiv:2606.08506v1 Announce Type: cross Abstract: Let $f(n,k)$ be the minimum size of a collection of bicliques such that (i) every edge of the complete graph $K_n$ is covered by at least one and at most $k$ bicliques in the collection, and (ii) for each edge $\{u,v\}$, the number of bicliques in which $u$ appears in the first class and $v$ in the second class differs by at most one from the number of bicliques in which $u$ appears in the second class and $v$ in the first class. For $k=1$,...

arXiv CS 1d ago

Decomposing tournaments into comparability graphs

Announce Type: cross Abstract: In this note, we introduce the \emph{partial order decomposition number} of a digraph $D$, denoted $pod(D)$, defined as the minimum integer $k$ such that $A(D)=A(P_1)\cup\cdots\cup A(P_k)$, where $P_1,\ldots,P_k$ are partial orders on $V(D)$. We prove that $\dic(D)\le \diomega(D)^{pod(D)}$ for every digraph $D$. In particular, every class of digraphs with bounded $pod$ is polynomially $\dic$-bounded. We apply this to tournaments, showing that if $\mathcal C$ is...

arXiv CS 1d ago

BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies

arXiv:2605.30660v1 Announce Type: new Abstract: Test-time scaling for vision-language-action (VLA) policies, methods such as RoboMonkey, SEAL, MG-Select, and V-GPS, samples K candidate action chunks at inference and executes the verifier-best. When all K candidates are unsafe, the system executes a violating action with no warning. We propose BOKBO, the first conformal abstention layer for K-sample VLA inference, providing finite-sample distribution-free guarantees on executed-violation rate.

arXiv CS 9d ago

CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability

arXiv:2606.01495v2 Announce Type: replace Abstract: We present CART (Context-Anchored Recurrent Transformer), a parameter-efficient language model that reuses a single shared core block R times across depth. Unlike prior looped transformers that recompute key-value tensors at every iteration, CART computes K and V once from a multi-layer prelude and has the recurrent core cross-attend to those frozen tensors via multi-head latent attention. A learned Linear Time-Invariant (LTI) gate keeps...

arXiv CS 6d ago