Home Knowledge Base ToxiGen

ToxiGen

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

The Refusal--Compliance Tradeoff: A Large-Scale Safety Behavior Audit of Large Language Models

arXiv:2605.05427v2 Announce Type: replace Abstract: Refusal rates are a poor proxy for LLM safety, i.e., a model may over-refuse benign prompts while still complying with harmful ones. We audit both failure modes across 21 open-weight LLMs on four safety benchmarks (OR-Bench, XSTest, ToxiGen, BOLD), using a composition adjustment to isolate model sensitivity from dataset toxicity confounds. We report three findings.

arXiv CS 8d ago

Steering LLM Viewpoints through Fabricated Evidence Injection

arXiv:2606.06244v1 Announce Type: new Abstract: As chatbots increasingly influence daily decision-making, their potential to produce misleading responses poses substantial risks to users. This paper investigates a critical cognitive vulnerability in LLMs: their tendency to uncritically trust external context when presented with fabricated evidence bearing markers of credibility. We introduce Ghostwriter, a two-phase attack framework that first repackages misleading statements with fabricated...

arXiv CS 5d ago