Monte Carlo KL
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
OPRD: On-Policy Representation Distillation
Announce Type: replace Abstract: On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts...
OPRD: On-Policy Representation Distillation
arXiv:2606.06021v1 Announce Type: new Abstract: On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation...
Local and Global Contraction Principles for MCMC Mixing
arXiv:2606.03033v1 Announce Type: new Abstract: We develop a contraction-based framework for proving mixing-time bounds for Markov chain Monte Carlo algorithms. The framework is built around global and local contraction coefficients of Markov kernels under the $\mathsf E_\gamma$-divergence with $\gamma\ge1$. For projected Langevin Monte Carlo on a compact convex domain, we show that Gaussian smoothing yields an explicit global contraction coefficient for the $\mathsf E_\gamma$-divergence....
OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification
arXiv:2606.01476v1 Announce Type: new Abstract: On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of...