Margin-Adaptive Direct Preference Optimization
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization
arXiv:2510.05342v2 Announce Type: replace Abstract: Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter this.