In-Training Defenses against Emergent Misalignment in Language Models

arXiv CS Friday 05 June 2026, 04:00 UTC By David Kacz\'er, Magnus J{\o}rgenv{\aa}g, Clemens Vetter, Esha Afzal, Robin Haselhorst, Lucie Flek, Florian Mai 1 min read

Key Points

Announce Type: replace Abstract: Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EM): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We...

arXiv:2508.06249v3 Announce Type: replace Abstract: Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EM): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EM that are practical for providers who expose fine-tuning via an API: We evaluate whether they a) prevent broad misalignment, b) allow narrow misalignment, c) learn well on benign tasks, and d) remain coherent. We investigate five training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\ell_2$ distance in feature space, (iii) preventive steering with an evil persona vector, (iv) interleaving training examples from a general instruct-tuning dataset and (v) inoculation prompting. We demonstrate that selecting interleaving data by the perplexity gap between aligned and misaligned models yields the best results overall.

Emergent Misalignment (ORG) API (ORG) EM (ORG)

Originally published by arXiv CS Read original →

In-Training Defenses against Emergent Misalignment in Language Models

Related Stories

SpaceX Price Tag is 'Very Steep': Renaissance's Kennedy

World's biggest whale graveyard found in Indian Ocean off Australia

The big question facing SpaceX investors: What are you really buying?

'Voltron: Legendary Defender' turns 10 today, and we think this mecha robot reboot was just as good as 'Power Rangers' and 'Transformers'