Home Knowledge Base Attack Success

Attack Success

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

POISE: Position-Aware Undetectable Skill Injection on LLM Agents

arXiv:2606.07943v1 Announce Type: new Abstract: Agent skills provide a lightweight mechanism for extending general-purpose agents, but their open format exposes them to skill-poisoning attacks. A practically dangerous injection must stay invisible: if executing the payload derails the user's legitimate task, the resulting failure signal invites inspection of the skill. We therefore evaluate attacks by Attack Success Rate, which requires the injected payload to execute and the user's task to...

arXiv CS 1d ago

EVA: Evolving Semantic Adversaries for Red-Teaming GUI Agents Against Environmental Injection Attacks

arXiv:2505.14289v2 Announce Type: replace Abstract: Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) are increasingly deployed yet vulnerable to Environmental Injection Attacks (EIAs).However, current red-teaming methods are hindered by prohibitive computational costs and limited adaptability. A fundamental question remains unaddressed: does the bottleneck of attack success lie in visual perception or semantic understanding? Through controlled...

arXiv CS 2d ago

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

arXiv:2606.05233v1 Announce Type: new Abstract: Recent computer-using-agent (CUA) red-teaming papers report prompt-injection attack success rates (ASR) of 42-98%, but these headline numbers cluster on retired models and on the most-vulnerable model in each paper's panel. We ask whether those techniques, reproduced as hand-crafted templates, still work against current frontier CUAs.

arXiv CS 5d ago

AtomEval: Validity-Aware Atomic Evaluation of Adversarial Claim Rewriting in Fact Verification

arXiv:2604.07967v3 Announce Type: replace Abstract: Large language models (LLMs) can rewrite refuted claims to evade evidence-based fact verifiers, but conventional attack success rate (ASR) can be inflated when rewrites change, weaken, or correct the false proposition they are supposed to preserve. We introduce AtomEval, a validity-aware evaluation protocol for fixed-evidence adversarial claim rewriting. AtomEval represents claims as subject--relation--object--modifier (SROM) atoms, applies...

arXiv CS 8d ago

Self-Mined Hardness for Safety Fine-Tuning

arXiv:2605.03226v2 Announce Type: replace Abstract: Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the hardest prompts paired with the model's own non-jailbroken rollouts. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this approach cuts the WildJailbreak attack success rate from 11.5% and 20.1%...

arXiv CS 1d ago

Anthropic expands access to Mythos despite warnings of mass cyberattacks

Anthropic is expanding access to Mythos, its powerful cybersecurity model, to 150 new partners across more than 15 countries even as the company warns that a successful attack on any of their systems could affect more than 100 million people. Anthropic is expanding testing of Mythos, its latest cybersecurity model, by bringing in around 150 additional partners. The company, seen as the main rival to OpenAI’s ChatGPT, announced in April that it was launching Mythos, a model it said “has...

Euronews 7d ago

Beyond Pass/Fail: Using Process Mining to Understand How LLMs Resist (and Fail) Red Team Attacks

Announce Type: new Abstract: Standard AI red teaming evaluations reduce adversarial campaigns to a single binary outcome, attack success rate (ASR), not taking into account the sequential structure of how models resist or yield to attacks. We propose applying process mining, a discipline for discovering and analyzing process models from event logs, to red teaming traces. We conduct a controlled experiment pitting 60 HarmBench prompts against two LLMs, GPT-OSS 120B and Llama 3.3 70B, using 10...

arXiv CS 1d ago

Characterizing Detectability in 3DGS Poisoning: A Stage-wise Benchmark

arXiv:2606.03499v1 Announce Type: new Abstract: 3D Gaussian Splatting (3DGS) has rapidly emerged as a leading representation for real-time novel view synthesis, but recent work shows it is vulnerable to diverse poisoning attacks, including illusory object injection, computation cost amplification, and post hoc model watermarking. Despite this expanding threat surface, existing studies focus mainly on attack success, while defense and detection remain underexplored. From a detection...

arXiv CS 7d ago

Depth-Dependent Indirect Prompt Injection in Tool-Calling ReAct Agents: Injection Depth, Payload Framing, and Turn-Budget Sensitivity

Announce Type: new Abstract: ReAct agents that interleave chain-of-thought reasoning with tool calls are increasingly deployed for real tasks such as scheduling, file retrieval, and data access. Their tool observation loop creates a direct attack surface: an adversary who controls any tool's return value can embed instructions that redirect the agent away from the user's goal, a threat known as indirect prompt injection. Existing benchmarks evaluate attack success rate (ASR) at a fixed...

arXiv CS 9d ago

The Surface You Test Is Not the Surface That Breaks

Announce Type: new Abstract: Tool-augmented LLM agents are vulnerable to prompt injection: a third party who controls part of the agent's context can plant instructions that the agent then executes as if they came from the user. Current evaluations report a single attack success rate per model on one channel, the tool output and treat that number as the model's vulnerability. But tool descriptions, which the agent reads at every turn before any tool is called, are themselves an injection...

arXiv CS 9d ago