Home Knowledge Base Bayesian Non-Negative Reward Model

Bayesian Non-Negative Reward Model

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

arXiv:2602.10623v2 Announce Type: replace Abstract: Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley-Terry (BT)...

arXiv CS 8d ago