Home Knowledge Base Annealed Softmax Greedy

Annealed Softmax Greedy

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Annealed Softmax Greedy in Many-Armed Bayesian Bandits

Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) and group-based policy optimization methods such as GRPO update a stochastic policy by sampling multiple completions per prompt and increasing the policy's probability on those with higher reward, regularized by a KL penalty toward a reference policy. These updates do not include explicit mechanisms that track epistemic uncertainty. This paper studies a stylized explanation for why such uncertainty-agnostic...

arXiv CS 9d ago