Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Dongze Hao, Zhiwei Jin, Chen Chen, Haonan Lu 1 min read

Key Points

arXiv:2606.09091v1 Announce Type: new Abstract: On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinforcement learning with verifiable rewards (RLVR), which typically depends on sparse binary or outcome-based environmental feedback. However, naive token-level distillation can suffer from gradient instability, due to magnitude misalignment in outlier states. To address this issue, we propose Globally Normalized Distillation Policy Optimization (GNDPO), a practical method that stabilizes optimization by transforming raw KL scores into batch-level relative advantages. This normalization effectively mitigates gradient explosions while retaining the benefits of token-level guidance. Experimental results show that GNDPO substantially improves training robustness and downstream performance across multimodal reasoning tasks. The code is released at https://github.com/OPPO-Mente-Lab/GNDPO.

OPD (ORG) KL (LOCATION) GNDPO (ORG) https://github.com/OPPO-Mente-Lab/GNDPO (PERSON)

Originally published by arXiv CS Read original →

Oracle awarded US government contract to provide government-wide HR software Source: Reuters Subscribe to our Chief Editor’s Week in Review Our chief editor shares analysis and picks of the week's biggest news every Saturday. Get our pick of top stories and thought-provoking articles in your inbox Subscribe hereStay updated with notifications for breaking news and our best stories Download hereGet WhatsApp alerts Join our channel for the top reads for the day on your preferred chat app Join...

Channel News Asia 18m ago

Karmelo Anthony verdict draws anti-white rage and lies from radical Dem congresswoman, angry activists

A Texas congresswoman is leading the voice of online activists enraged over the guilty verdict in Karmelo Anthony's murder trial, and is spreading outright lies and racially inflammatory rhetoric after the 19-year-old was sentenced to 35 years in prison for stabbing Austin Metcalf to death. Rep. Jasmine Crockett, a rare radical Democrat elected in deep red Texas, took to her podcast after Tuesday's verdict to make false claims about the trial and its jury as she continues to stir up racial...

Fox News Politics 22m ago

Angela Rayner demands visa rules shake-up for care workers 'living in fear'

Angela Rayner demands visa rules shake-up for care workers 'living in fear' The former Deputy Prime Minister, a former carer, said migrant staff were trapped in a system that leaves them at the mercy of an employer over their right to remain in the UK Angela Rayner has piled fresh pressure on the Government to shake-up visa rules that leave care workers living in fear. The former Deputy Prime Minister, a former carer, said migrant staff were trapped in a system that leaves them at the mercy...

Daily Mirror 39m ago

Purple Heart recipient mocked by Platner says PTSD does not excuse 'abhorrent behavior'

A Purple Heart recipient wounded in Afghanistan is speaking out after Reddit comments linked to Democratic Maine Senate candidate Graham Platner resurfaced, saying PTSD does not excuse mocking a wounded American service member. Speaking on "The Ingraham Angle," Pfc. Ted Daniels, who received a Purple Heart after surviving a Taliban attack, pushed back against efforts to explain Platner's comments by citing PTSD."Right now it appears that Graham Platner is the poster child for people who...

Fox News 50m ago

Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

Related Stories

Oracle awarded US government contract to provide government-wide HR software

Karmelo Anthony verdict draws anti-white rage and lies from radical Dem congresswoman, angry activists

Angela Rayner demands visa rules shake-up for care workers 'living in fear'

Purple Heart recipient mocked by Platner says PTSD does not excuse 'abhorrent behavior'