Home › Knowledge Base › Progressive OPD

Progressive OPD

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

arXiv:2605.09253v2 Announce Type: replace Abstract: While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that--as the most direct signal of student-teacher mismatch under OPD's per-token KL objective--should progressively...

arXiv CS 8d ago

Are Full Rollouts Necessary for On-Policy Distillation?

arXiv:2605.31490v2 Announce Type: replace Abstract: On-policy distillation (OPD) provides dense teacher feedback along student-generated rollouts rather than fixed teacher traces and has emerged as a promising post-training paradigm. However, standard OPD typically generates full rollouts during training, which is computationally expensive and may expose the student to unreliable teacher feedback at late rollout positions, especially during early training. We identify the rollout horizon as...

arXiv CS 8d ago

Are Full Rollouts Necessary for On-Policy Distillation?

Announce Type: new Abstract: On-policy distillation (OPD) provides dense teacher feedback along rollouts generated by the student and has emerged as a promising post-training paradigm for long-horizon reasoning. However, standard OPD typically generates full rollouts during training, which is computationally expensive and may expose the student to unreliable teacher feedback at late rollout positions, especially during early training. We identify the rollout horizon as a key bottleneck in...

arXiv CS 9d ago