Home Knowledge Base LLM Defender

LLM Defender

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Strengthening Polymorphic Prompt Assembling: Dynamic Separator Generation Against Emerging Prompt Injection Attacks

arXiv:2605.30534v1 Announce Type: new Abstract: Polymorphic Prompt Assembling (PPA) defends LLM agents against prompt injections by randomly selecting separator pairs from a fixed pool to isolate user input from system instructions. Although effective, static pool reuse exposes a blast-radius vulnerability: once a separator leaks, it can be exploited in future requests. We propose a dynamic per-request separator generation using domain-separated SHA-256 digests keyed on the timestamp,...

arXiv CS 9d ago

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Announce Type: new Abstract: LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers.

arXiv CS 9d ago

ZERO-APT: A Closed-Loop Adversarial Framework for LLM-Driven Automated Penetration Testing under Intelligent Defense

arXiv:2606.05567v1 Announce Type: new Abstract: LLM-driven automated penetration testing agents are typically evaluated against static targets that neither detect nor respond to attacks, so their behavior under intelligent defense remains untested. The causal consistency of multi-step attack chains likewise hinges on unstable LLM reasoning, and agent decisions remain opaque to human analysts. These three shortcomings, in realism, consistency, and auditability, are usually patched in isolation.

arXiv CS 5d ago

Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

Announce Type: new Abstract: Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage when unknown backdoors may exist in a model. We show that backdoor neutralization through unlearning generalizes across backdoors: training a model to ignore a single trigger can also...

arXiv CS 7d ago

Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

arXiv:2606.03785v2 Announce Type: replace Abstract: Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage when unknown backdoors may exist in a model. We show that backdoor neutralization through unlearning generalizes across backdoors: training a model to ignore a...

arXiv CS 5d ago

Context-Fractured Decomposition Attacks on Tool-Using LLM Agents: Exploiting Artifact Provenance Gaps

arXiv:2606.09084v1 Announce Type: new Abstract: Tool-using LLM agents interact with the world through actions that persist state in artifacts (e.g., workspace files or logs). Consequently, jailbreak defenses must reason about cross-step composition rather than isolated text. Yet most existing attacks and defenses, including ``multi-turn'' jailbreaks such as Crescendo and Tree of Attacks,still assume a single contiguous conversation visible to the defender.

arXiv CS 1d ago

Chuck, Wilson and the emergence of artificial minds in human-AI conversations

arXiv:2601.13081v2 Announce Type: replace Abstract: Large Language Models (LLMs) can simulate person-like things which at least appear to have stable behavioural and psychological dispositions. Call these things characters. Are characters minded and psychologically continuous entities with mental states like beliefs, desires and intentions?

arXiv CS 6d ago

Nobody needs Mythos or 0-days to build a chaos-causing computer worm – free open source models work just fine

There's a lot of fear surrounding the bug-finding capabilities of super-advanced AI models like Anthropic's Mythos and OpenAI's GPT 5.5-Cyber. But attackers are already using free, publicly available LLMs to hijack networks and worm through software supply chains at a much lower cost – to them at least. The latest example comes from University of Toronto researchers, who used an unnamed, publicly available open-weight model released in 2025 to develop a computer worm that they claim spread...

The Register 6d ago

Closing the Sim-to-Real Gap: An Evaluation Framework for Autonomous Cyber Defense Configuration of Commercial EDR

new Abstract: Leading commercial endpoint detection and response (EDR) products have shifted from operator-configured rule sets to multi-component systems where autonomous AI components operate alongside, and increasingly in place of, operator-deployed policies. Autonomous defense agents using commercial EDR as their hardening tool are no longer tuning a passive tool, but a black-box autonomous system capable of making vendor-specific decisions. We present the first evaluation framework for...

arXiv CS 1d ago

CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

arXiv:2606.05523v1 Announce Type: new Abstract: Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on non-scalable human curation or white-box optimisation that overfits to specific model internals, leaving aligned models brittle against the very class of adaptive black-box adversaries they will face in deployment. To...

arXiv CS 5d ago