Stabilizing Policy Optimization via Logits Convexity

arXiv CS Tuesday 02 June 2026, 04:00 UTC By Hongzhan Chen, Tao Yang, Yuhua Zhu, Shiping Gao, Xiaojun Quan, Ting Yao 1 min read

Key Points

arXiv:2603.00963v2 Announce Type: replace Abstract: While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity. Extensive experiments across multiple model families show that our LCO framework consistently improves training stability and outperforms conventional RL methods on a broad range of benchmarks.

Logits Convexity arXiv:2603.00963v2 (ORG) RL (ORG) SFT (ORG) Proximal Policy Optimization (ORG) PPO (ORG) Logits Convex Optimization (ORG) LCO (ORG)

Originally published by arXiv CS Read original →

Stabilizing Policy Optimization via Logits Convexity

Related Stories

Labour slams ‘appalling’ Elon Musk after Belfast riots

Taking a Week to Count Votes Is Doing It Wrong

Indonesian union boss defends joining Prabowo’s government

Bill Gates to appear before Congress over Epstein involvement