Home › Knowledge Base › Reward-Guided Policy Optimization

Reward-Guided Policy Optimization

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation

arXiv:2606.08480v1 Announce Type: new Abstract: Reinforcement learning (RL) presents a promising avenue for enhancing generative recommendation beyond supervised imitation, leveraging reward signals to guide policy improvement. However, its efficacy is critically contingent on the trustworthiness of the reward model for the samples it evaluates. In practice, production rankers, the widely adopted reward models, are trained on exposure-biased logs, leading to sample-dependent inaccuracies...

arXiv CS 1d ago

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

arXiv:2605.26108v3 Announce Type: replace Abstract: Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher...

arXiv CS 1d ago

ImageAuditor: Membership Inference Attack against Image-based Retrieval-Augmented Generation

Announce Type: new Abstract: Image-based Retrieval-Augmented Generation (IRAG) conditions a frozen generator on reference images retrieved from an external database, supporting both text-to-image (T2I) and question answering (Q&A) tasks. Because these databases are opaque and web-scraped, copyright holders need ways to audit whether specific images appear in them. While prior work employs membership inference attacks (MIAs) to audit uni-modal, text-based RAG, they fail to transfer to...

arXiv CS 7d ago

Self-Imitated Diffusion Policy for Efficient and Robust Visual Navigation

arXiv:2601.22965v2 Announce Type: replace Abstract: Diffusion policies (DP) have demonstrated significant potential in visual navigation by capturing diverse multi-modal trajectory distributions. However, standard imitation learning (IL), which most DP methods rely on for training, often inherits sub-optimality and redundancy from expert demonstrations, thereby necessitating a computationally intensive "generate-then-filter" pipeline that relies on auxiliary selectors during inference. To...

arXiv CS 8d ago