Home Knowledge Base pass@$K$

pass@$K$

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning

arXiv:2605.27000v2 Announce Type: replace Abstract: Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@$K$ as the canonical metric. Yet the standard policy class draws $K$ independent samples from a single answer distribution, so attempts often collapse onto near-duplicate reasoning paths and waste the budget on redundant rollouts. This failure is costly in competitive programming, where many problems admit multiple distinct...

arXiv CS 8d ago

Teacher-Free Self-Training Amplifies but Does Not Compound: A Pass@$K$ Crossover on a Free-Verifier Domain

Announce Type: new Abstract: When a language model trains on its own verified outputs, does it acquire capability beyond its base, or merely get better at expressing capability the base already had? We make the question decidable with a teacher-free "constellation" -- a generator, a learned critic, and a free exact verifier -- on a FlashFill-style "trapdoor" DSL, where verified (problem, solution) pairs are cheap to synthesize, hard to invert, and free to check exactly. Everything runs on...

arXiv CS 1d ago

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

Announce Type: new Abstract: Test-time reinforcement learning has emerged as a promising paradigm for enhancing the complex reasoning abilities of large language models in a completely label-free manner. Despite existing studies focusing on Pass@1 performance, optimizing Pass@k remains under-explored yet critical in label-free settings, which measures generation coverage for sustained exploration. Optimizing Pass@k in label-free setting is highly non-trivial, as directly applying the Pass@k...

arXiv CS 7d ago

More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)

arXiv:2601.21522v2 Announce Type: replace Abstract: The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a...

arXiv CS 1d ago

On Advantage Estimates for Max@K Policy Gradients

arXiv:2606.06080v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pass@K and max@K directly, yet existing policy-gradient estimators for these objectives use different signals, baselines, and normalizations, making their relationships unclear. We study this issue through baseline design...

arXiv CS 5d ago

Retry Policy Gradients in Continuous Action Spaces

arXiv:2606.05888v1 Announce Type: new Abstract: Retry-based objectives such as pass@K and max@K optimize the best return obtained from multiple sampled trajectories, and recent work has shown that they can promote exploration without explicit exploration bonuses. In discrete action spaces, ReMax was shown to do so by adapting to return uncertainty. In this work, we introduce pathwise derivative estimators for retry objectives and use them to extend ReMax to continuous action spaces.

arXiv CS 5d ago

Evaluation of LLMs for Mathematical Formalization in Lean

arXiv:2606.05632v1 Announce Type: new Abstract: Within the past few years, the ability of Large Language Models (LLMs) to generate formal mathematical proofs has improved drastically. We provide a comparison of various LLMs' effectiveness in producing formal proofs in Lean 4 with the goal of assisting those seeking to use LLMs to support their own projects. We utilize both pass@$k$ and refine@$k$ metrics as the benchmark for our comparison and evaluate on subsets of both miniF2F and miniCTX...

arXiv CS 5d ago

Finite-Time Regret Analysis of Retry-Aware Bandits

Announce Type: replace Abstract: We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@$k$ and max@$k$. Given a posterior over arm values, ReMax chooses a sampling distribution that maximizes the posterior expected maximum reward over $M$ virtual draws. Although this objective was introduced in reinforcement learning as an exploration mechanism under uncertainty, its regret properties in bandit problems...

arXiv CS 8d ago

SWE-IF: Aligning Code Evaluation with Human Preference

arXiv:2510.07315v2 Announce Type: replace Abstract: Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check reflects human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional...

arXiv CS 2d ago

SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs

Announce Type: replace Abstract: Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tool-use costs like simulation time and experimental resources. As a result, metrics like pass@k become impractical under realistic budget constraints. To address this gap, we introduce SimulCost, the first benchmark targeting cost-sensitive parameter tuning in physics simulations.

arXiv Physics 9d ago