the Step-Wise Refusal Internal Dynamics (SRI
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models
arXiv:2602.02600v3 Announce Type: replace Abstract: Diffusion language models (DLMs) have recently emerged as a competitive alternative to autoregressive (AR) models, offering parallel decoding, competitive generation quality, and initial evidence of improved jailbreak robustness. Despite this progress, the role of sampling mechanisms in shaping refusal behavior remains poorly understood. To address this gap, we present a comprehensive study of step-wise refusal dynamics.