Home Knowledge Base Reflector achieves Defense Success

Reflector achieves Defense Success

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

arXiv:2605.20654v2 Announce Type: replace Abstract: While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflection within the generation trajectory. Reflector first leverages...

arXiv CS 6d ago