REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

arXiv CS Thursday 04 June 2026, 04:00 UTC By Jiachen Ma, Jiawen Zhang, Xiangtian Li, Bo Zou, Chaochao Lu, Chao Yang 1 min read

Key Points

arXiv:2605.20654v2 Announce Type: replace Abstract: While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflection within the generation trajectory. Reflector first leverages teacher-guided generation to produce high-quality reflection data for supervised fine-tuning (SFT), establishing structured reflection patterns. It subsequently uses Reinforcement Learning (RL) with outcome-driven and reward-validity supervision to instill robust, autonomous self-reflection capabilities. Empirical results show that Reflector achieves Defense Success Rates (DSR) exceeding 90% against complex indirect attacks while generalizing robustly across diverse threat scenarios. Notably, the framework enhances both task-specific and general utility, yielding a 5.85% gain on GSM8K alongside improved performance on knowledge-intensive benchmarks. By internalizing trajectory-level safety, Reflector overcomes the fundamental limitations of surface alignment without significant computational overhead, offering an efficient and scalable solution for the development of safe and capable LLMs.

SFT (ORG) Reinforcement Learning (ORG) Reflector achieves Defense Success (ORG) DSR (ORG)

Originally published by arXiv CS Read original →

School knife attack suspect girl detained under Mental Health Act The school placed itself in lockdown after the incident - Bookmark A teenage girl who was arrested after two pupils and a member of staff were injured in a knife attack at a school has been detained under the Mental Health Act. The 14-year-old girl was arrested on suspicion of assault after officers were called to reports of a stabbing at the Co-op Academy in north Manchester on Tuesday morning, according to Greater Manchester...

The Independent UK 26m ago

FBI nabs 7 for alleged 'campaign of violence' to pressure University of Michigan, businesses over Israel ties

A group of college-aged activists were arrested after they allegedly waged a year-long campaign of intimidation, vandalism and threats against University of Michigan officials, businesses and Jewish-linked institutions to pressure them into cutting ties with Israel, FBI Director Kash Patel announced Wednesday. According to a federal indictment, the suspects are accused of coordinating attacks that included spray-painting homes with messages such as "Intifada" and "Free Palestine," leaving...

Fox News 27m ago

Cyber gangs access students' personal data in University of Nottingham hack

Cyber gangs access students' personal data in University of Nottingham hack The University of Nottingham has told students a 'significant' amount of data has been accessed in a major cyber-attack, reportedly carried out by a group called the Shinyhunters Students at one of the country's leading universities have been left exposed by a major cyber attack which could disrupt their exam marking. Pupils' personal and financial data could have been accessed at the University of Nottingham...

Daily Mirror 1h ago

Noah Donohoe's friends 'did not believe he was subjected to racism'

The 14-year-old mixed race schoolboy disappeared and died in north Belfast almost six years ago.

BBC Northern Ireland 1h ago

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

Related Stories

School knife attack suspect girl detained under Mental Health Act

FBI nabs 7 for alleged 'campaign of violence' to pressure University of Michigan, businesses over Israel ties

Cyber gangs access students' personal data in University of Nottingham hack

Noah Donohoe's friends 'did not believe he was subjected to racism'