ISPO
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization
Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for eliciting long-chain reasoning in large language models. However, existing methods based on Group Relative Policy Optimization (GRPO) rely on a binary outcome reward, which induces two structural failure modes: Zero-Advantage Collapse, in which all rollouts in a group share the same outcome and the gradient vanishes, and Hallucinated Certainty, in which the model becomes...