Home Knowledge Base the Step-Wise Refusal Internal Dynamics (SRI

the Step-Wise Refusal Internal Dynamics (SRI

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models

arXiv:2602.02600v3 Announce Type: replace Abstract: Diffusion language models (DLMs) have recently emerged as a competitive alternative to autoregressive (AR) models, offering parallel decoding, competitive generation quality, and initial evidence of improved jailbreak robustness. Despite this progress, the role of sampling mechanisms in shaping refusal behavior remains poorly understood. To address this gap, we present a comprehensive study of step-wise refusal dynamics.

arXiv CS 2d ago