Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4

arXiv CS Friday 05 June 2026, 04:00 UTC By Alex Polyakov, Daniel Kuznetsov 1 min read

Key Points

arXiv:2604.19461v2 Announce Type: replace Abstract: Safety alignment in large language models relies on behavioral training that can be overridden when sufficiently strong in-context patterns compete with learned refusal behaviors. We introduce Involuntary In-Context Learning (IICL), an attack class that uses abstract operator framing with few-shot examples to force pattern completion that overrides safety training. Through 3479 probes across 10 OpenAI models, we identify the attack's effective components through a seven-experiment ablation study. Key findings: (1)~semantic operator naming achieves 100% bypass rate (50/50, $p < 0.001$); (2)~the attack requires abstract framing, since identical examples in direct question-and-answer format yield 0%; (3)~example ordering matters strongly (interleaved: 76%, harmful-first: 6%); (4)~temperature has no meaningful effect (46-56% across 0.0--1.0). On the HarmBench benchmark, IICL achieves 24.0% bypass $[18.6%, 30.4%]$ against GPT-5.4 with detailed 619-word responses, compared to 0.0% for direct queries.

HarmBench (ORG) IICL (ORG) GPT-5.4 (PERSON)

Originally published by arXiv CS Read original →

Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4

Related Stories

'Voltron: Legendary Defender' turns 10 today, and we think this mecha robot reboot was just as good as 'Power Rangers' and 'Transformers'

Exclusive-GM may ditch LFP batteries for future EVs

Musk Stock Fans Say ‘The More, The Better’ in SpaceX IPO Frenzy

Musk Stock Fans Say ‘The More, The Better’ in SpaceX IPO Frenzy