Home › Knowledge Base › a Reinforcement Learning with Verifiable Rewards

a Reinforcement Learning with Verifiable Rewards

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

arXiv:2606.09393v1 Announce Type: new Abstract: Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable annotations and often causes models to memorize specific ground-truth answers, limiting their generality and ability to...

arXiv CS 1d ago

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

arXiv:2605.12969v3 Announce Type: replace Abstract: Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reasoning tasks. We first show that GRPO admits an equivalent discriminative reformulation, in which policy optimization maximizes the expected score gap between verified positive and negative rollouts. This reformulation reveals two objective-level limitations: likelihood-misaligned surrogate scores, in...

arXiv CS 8d ago

GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

Announce Type: new Abstract: Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for step-level supervision. Uniform advantage distribution assumes that all tokens contribute equally to the final reward.

arXiv CS 6d ago

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

arXiv:2606.05263v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards improves reasoning and tool use, yet long-horizon language agents still learn unsupported evidence chains, belief drift, and shortcut actions that satisfy terminal checks. Existing process rewards are mostly correlational: they reward retrieval-, reflection-, or verification-like steps without estimating whether the step contributes to final verified success under a specified intervention. We...

arXiv CS 5d ago

Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

Announce Type: new Abstract: Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual state (so the generated tool calls fail to execute), and recall-based RL rewards incentivize verbose tool-calling patterns. We present PROVE (Programmatic Rewards On Verified Environments), a framework with three contributions: (1) a library...

arXiv CS 7d ago

Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

Announce Type: replace Abstract: Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual state (so the generated tool calls fail to execute), and recall-based RL rewards incentivize verbose tool-calling patterns. We present PROVE (Programmatic Rewards On Verified Environments), a framework with three contributions: (1) a...

arXiv CS 6d ago

Policy Improvement Reinforcement Learning

Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design -- updating in isolation at each step, guided only by...

arXiv CS 6d ago

Leveraging Error Diversity in Group Rollouts for Reinforcement Learning

Announce Type: replace Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) typically samples multiple responses per prompt and assigns binary rewards based on individual correctness, yet the collective structure of the group output, specifically the distribution of errors, is largely discarded. We identify this as a missed opportunity: empirical analysis reveals that error diversity within a group is a strong predictor of training success, with problems eliciting diverse wrong...

arXiv CS 2d ago

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

arXiv:2511.07317v2 Announce Type: replace Abstract: We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses. In contrast, static data...

arXiv CS 1d ago

Automating Formal Verification with Reinforcement Learning and Recursive Inference

arXiv:2605.30914v1 Announce Type: new Abstract: Automated formal verification remains challenging for large language models because data for proof assistants and verification-aware languages is scarce, and correctness depends on satisfying precise machine-checkable specifications rather than producing plausible code. This thesis studies how verifier environments can improve LLM generation of verified programs and proofs through reinforcement learning from verifiable rewards (RLVR) and...

arXiv CS 9d ago