Home Knowledge Base Policy Optimization (PO

Policy Optimization (PO

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

arXiv:2606.07000v1 Announce Type: new Abstract: Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-Language Models (LVLMs). However, the sparse nature of verifiable rewards provides little token-level supervision for failed rollouts, often leading to inefficient exploration in complex multimodal reasoning tasks.

arXiv CS 2d ago

Revisiting Zeroth-Order Hessian Approximation: A Single-Step Policy Optimization Lens

Announce Type: new Abstract: Accurate Zeroth-Order (ZO) Hessian estimation is a cornerstone of derivative-free methods, essential for tasks such as bilevel optimization, Bayesian inference, and uncertainty quantification. However, obtaining a complete suite of low-variance estimators for the Hessian and its inverse in high-dimensional settings remains a significant challenge. To address this, we propose a unified framework that reinterprets ZO Hessian approximation through the lens of...

arXiv CS 9d ago

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

arXiv:2605.30789v1 Announce Type: new Abstract: We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their...

arXiv CS 9d ago

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

arXiv:2605.30789v2 Announce Type: replace Abstract: We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by...

arXiv CS 7d ago