GPT-OSS-Safeguard

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

arXiv:2603.24511v2 Announce Type: replace Abstract: We show that AI agents are capable of discovering novel algorithms for adversarial attacks against LLMs, advancing the state of the art on white-box jailbreaking and prompt injection evaluations. We deploy frontier agents, such as Claude Code and Codex, in an autoresearch loop with access to a library of 30+ prior methods and an evaluation script with a fixed compute budget. We show this pipeline to be effective in jailbreaking OpenAI's...

arXiv CS 8d ago

Steering LLM Viewpoints through Fabricated Evidence Injection

arXiv:2606.06244v1 Announce Type: new Abstract: As chatbots increasingly influence daily decision-making, their potential to produce misleading responses poses substantial risks to users. This paper investigates a critical cognitive vulnerability in LLMs: their tendency to uncritically trust external context when presented with fabricated evidence bearing markers of credibility. We introduce Ghostwriter, a two-phase attack framework that first repackages misleading statements with fabricated...

arXiv CS 5d ago

Sovereign News Station

Self-hosted. No tracking. No ads. Independent news intelligence powered by sovereign infrastructure.

Daily briefing to your inbox:

Subscribed. Welcome aboard.

Home Live Analysis Trending Analytics Operations RSS Feed About

Sovereign News Station — Independent news intelligence · Self-hosted · No tracking