Home Knowledge Base MaxPO

MaxPO

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

On Advantage Estimates for Max@K Policy Gradients

arXiv:2606.06080v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pass@K and max@K directly, yet existing policy-gradient estimators for these objectives use different signals, baselines, and normalizations, making their relationships unclear. We study this issue through baseline design...

arXiv CS 5d ago