Home Knowledge Base Standard Reinforcement Learning with Verifiable Rewards

Standard Reinforcement Learning with Verifiable Rewards

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

EchoRL: Reinforcement Learning via Rollout Echoing

arXiv:2605.31228v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards is an effective route for post-training to strengthen the reasoning capability of large language models. However, as training proceeds, the learning signal can collapse thus makes the training gain become marginal and ineffective. Specifically, a growing fraction of prompts' rollouts become advantage-degenerated: all the self-generated rollouts show verified-success, making the standard deviation...

arXiv CS 9d ago

Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning

arXiv:2603.09803v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves reasoning in large language models but treats all correct solutions equally, potentially reinforcing flawed traces that arrive at correct answers by chance. We observe that \emph{better reasoning makes better demonstrations}: high-quality solutions serve as more effective in-context examples than low-quality ones. We term this teaching ability \textbf{Demonstration Utility}, and...

arXiv CS 6d ago

Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

Announce Type: new Abstract: Group Relative Policy Optimisation (GRPO) has emerged as an effective reinforcement-learning algorithm for aligning language models on reasoning tasks, but it treats every token position and every sampled rollout symmetrically. We introduce two complementary extensions: (i) Adaptive-Horizon GRPO (AH-GRPO), which weights each token's policy gradient using a cumulative entropy-based discount that reduces the effective horizon when the model is uncertain, and (ii)...

arXiv CS 5d ago

sGPO: Trading Inference FLOPs for Training Efficiency in RLVR

Announce Type: new Abstract: Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query's difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a...

arXiv CS 1d ago

Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

arXiv:2606.04396v1 Announce Type: new Abstract: Diffusion large language models (dLLMs) generate responses by iteratively unmasking and revising many positions in parallel. This process leaves a rich denoising trace depicting which tokens become confident, which remain unstable, and when commitments form. Existing dLLM reinforcement learning methods use this signal only weakly.

arXiv CS 6d ago

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

Announce Type: new Abstract: Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample inefficient. Each rollout is used for a single gradient update and then discarded.

arXiv CS 6d ago

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

arXiv:2606.04560v2 Announce Type: replace Abstract: Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample inefficient. Each rollout is used for a single gradient update and then discarded.

arXiv CS 5d ago

MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

arXiv:2606.06058v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance...

arXiv CS 5d ago

GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling

arXiv:2606.04516v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) significantly advances LLM reasoning, yet it faces a dilemma: standard supervised scaling is throttled by high annotation costs, while unsupervised alternatives suffer from severe model collapse. Recent semi-supervised RLVR methods address this by using a small labeled set to guide unlabeled data, achieving a promising trade-off between training efficacy and annotation cost. However, they...

arXiv CS 6d ago

Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR

arXiv:2606.03087v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) improves the ability of large language model, yet headline accuracy gains often conceal a hidden cost: previously solved problems quietly become unsolvable as training proceeds. We frame this phenomenon as \emph{correct-set turnover}, representing the coupled dynamics of solution acquisition and regression over the mastered set. Under this view, retention becomes an explicit optimization...

arXiv CS 7d ago