Home › Business & Finance › Measuring Alignment-Induced Activation Shifts Correctly:...

Business & Finance

Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol

arXiv CS Tuesday 02 June 2026, 04:00 UTC By Yuki Nakamura 1 min read

Key Points

arXiv:2605.24583v3 Announce Type: replace Abstract: Comparing a model's internal activations before and after alignment is a natural way to ask what safety training changes: one forms the matrix of paired aligned-minus-base activations on safety-relevant inputs and reads off its effective rank or top direction. We show the obvious way to form this matrix is confounded. The aligned model is evaluated under a chat template the base model never saw, so the naive difference conflates the alignment shift with chat formatting. We introduce a four-variant decomposition of the modification matrix (naive, template-controlled, within-aligned, and difference-in-differences, DiD) that separates the two effects. Template control alone removes a 2.0-3.9x inflation of the measured effective rank across Llama-3.1-8B, Gemma-2-9B, and Qwen-2.5-7B; the DiD contrast is what recovers the refusal direction of Arditi et al. (2024), lifting its cosine alignment from 0.18-0.39 to 0.50-0.86. Projection-ablation across the three families confirms the recovered subspace is behaviorally active and that singular-value order is not causal order. We validate the protocol on a controlled testbed and distill it into measurement recommendations for activation-difference studies of alignment.

Arditi et al. (PERSON)

Originally published by arXiv CS Read original →

Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol

Related Stories

Valve will stop producing physical Steam gift cards because of scammers

Oracle Reports Higher-Than-Expected Data Center Spending

USDA's Rollins called screwworm a 'little pest' amid U.S. spread. Last year, she called it 'terrifying'

Citi Says Investors Growing More Selective on Data Center Bonds