Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Xiang Li, Chong Zhang, Jia Wang, Fangyu Wu, Yushi Li, Xiaobo Jin 1 min read

Key Points

Announce Type: replace Abstract: Current jailbreak attacks on large language models (LLMs) predominantly rely on LLMs themselves to generate adversarial prompts, creating a critical efficiency bottleneck: each attack requires substantial computational resources and API queries, limiting scalability and practical deployment. To overcome this limitation, we propose Adversarial Prompt Distillation (APD), a novel framework that transfers jailbreaking capabilities from LLMs to small language...

arXiv:2506.17231v3 Announce Type: replace Abstract: Current jailbreak attacks on large language models (LLMs) predominantly rely on LLMs themselves to generate adversarial prompts, creating a critical efficiency bottleneck: each attack requires substantial computational resources and API queries, limiting scalability and practical deployment. To overcome this limitation, we propose Adversarial Prompt Distillation (APD), a novel framework that transfers jailbreaking capabilities from LLMs to small language models (SLMs) for efficient, low-resource attacks. APD integrates three key components: (1) masked adversarial knowledge pre-training via LoRA fine-tuning, (2) dynamic temperature-controlled knowledge distillation to bridge architectural gaps, and (3) reinforcement learning-based template optimization for adaptive refinement. Extensive experiments across 12 models show that APD achieves state-of-the-art attack success rates (e.g., 96.4% ASR_k on GPT-4) while dramatically improving efficiency - generating prompts 3.7x faster with 11.3x fewer parameters than teacher models. Our work establishes the first practical framework for lightweight jailbreak attacks, exposes new vulnerabilities in LLM defenses, and provides a scalable testbed for advancing AI safety research. Our code is available at: https://github.com/lxgem/Efficient_and_Stealthy_Jailbreak_Attacks_via_Adversarial_Prompt.

Adversarial Prompt Distillation (ORG) API (ORG) APD (ORG) LLM (ORG)

Originally published by arXiv CS Read original →

Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs

Related Stories

Whale graveyard dating back five million years discovered

SpaceX Leaves Some Banks Peeved at Junior Roles in IPO Lineup

'Worrying' pollution in Cotswolds river - volunteers

Nasa chief defends choice of all-male Artemis III crew