OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

arXiv CS Tuesday 02 June 2026, 04:00 UTC By Yuxiao Yang, Xiaoyun Wang, Weitong Zhang 1 min read

Key Points

Announce Type: replace Abstract: We study on-policy self-distillation (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite its promise, OPSD can suffer from training instability due to a pattern mismatch between teacher and student responses. Self-reflected teacher responses may introduce reflection-induced biases and response templates that miscalibrate token-level supervision, ultimately...

arXiv:2605.12400v2 Announce Type: replace Abstract: We study on-policy self-distillation (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite its promise, OPSD can suffer from training instability due to a pattern mismatch between teacher and student responses. Self-reflected teacher responses may introduce reflection-induced biases and response templates that miscalibrate token-level supervision, ultimately harming the student's reasoning ability. To mitigate this issue, we propose OGLS-SD, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to calibrate privileged teacher logits. Specifically, OGLS-SD contrasts teacher logits induced by successful and failed on-policy trajectories, constructing an outcome-discriminative steering direction for token-level guidance. Experiments on mathematical reasoning benchmarks show that OGLS-SD stabilizes self-distillation and improves performance over standard OPSD and other variants.

OGLS-SD (ORG) OPSD (ORG)

Originally published by arXiv CS Read original →

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

Related Stories

Five jailed for violence at Henry Nowak police protest

School uniform charity plans fundraising week

Students' data taken in major university cyber-attack

Students' data taken in major university cyber-attack