Standard Reinforcement Learning with Verifiable Rewards
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
EchoRL: Reinforcement Learning via Rollout Echoing
arXiv:2605.31228v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards is an effective route for post-training to strengthen the reasoning capability of large language models. However, as training proceeds, the learning signal can collapse thus makes the training gain become marginal and ineffective. Specifically, a growing fraction of prompts' rollouts become advantage-degenerated: all the self-generated rollouts show verified-success, making the standard deviation...
Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning
arXiv:2603.09803v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves reasoning in large language models but treats all correct solutions equally, potentially reinforcing flawed traces that arrive at correct answers by chance. We observe that \emph{better reasoning makes better demonstrations}: high-quality solutions serve as more effective in-context examples than low-quality ones. We term this teaching ability \textbf{Demonstration Utility}, and...
Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models
Announce Type: new Abstract: Group Relative Policy Optimisation (GRPO) has emerged as an effective reinforcement-learning algorithm for aligning language models on reasoning tasks, but it treats every token position and every sampled rollout symmetrically. We introduce two complementary extensions: (i) Adaptive-Horizon GRPO (AH-GRPO), which weights each token's policy gradient using a cumulative entropy-based discount that reduces the effective horizon when the model is uncertain, and (ii)...
sGPO: Trading Inference FLOPs for Training Efficiency in RLVR
Announce Type: new Abstract: Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query's difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a...
Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models
arXiv:2606.04396v1 Announce Type: new Abstract: Diffusion large language models (dLLMs) generate responses by iteratively unmasking and revising many positions in parallel. This process leaves a rich denoising trace depicting which tokens become confident, which remain unstable, and when commitments form. Existing dLLM reinforcement learning methods use this signal only weakly.
Rollout-Level Advantage-Prioritized Experience Replay for GRPO
Announce Type: new Abstract: Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample inefficient. Each rollout is used for a single gradient update and then discarded.
Rollout-Level Advantage-Prioritized Experience Replay for GRPO
arXiv:2606.04560v2 Announce Type: replace Abstract: Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample inefficient. Each rollout is used for a single gradient update and then discarded.
MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following
arXiv:2606.06058v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance...
GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling
arXiv:2606.04516v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) significantly advances LLM reasoning, yet it faces a dilemma: standard supervised scaling is throttled by high annotation costs, while unsupervised alternatives suffer from severe model collapse. Recent semi-supervised RLVR methods address this by using a small labeled set to guide unlabeled data, achieving a promising trade-off between training efficacy and annotation cost. However, they...
Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR
arXiv:2606.03087v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) improves the ability of large language model, yet headline accuracy gains often conceal a hidden cost: previously solved problems quietly become unsolvable as training proceeds. We frame this phenomenon as \emph{correct-set turnover}, representing the coupled dynamics of solution acquisition and regression over the mastered set. Under this view, retention becomes an explicit optimization...