ERPD
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Extreme Region Policy Distillation
arXiv:2605.25582v2 Announce Type: replace Abstract: Reinforcement learning for large language models faces a fundamental trade-off between sample efficiency and asymptotic performance: strictly on-policy methods discard trajectories after a single update, while off-policy reuse introduces distribution mismatch that existing trust-region techniques mitigate primarily by enforcing conservative optimization, often leaving rich training signals underutilized. To investigate this, we perform...