Home › Business & Finance › The Refusal--Compliance Tradeoff: A Large-Scale Safety...

Business & Finance

The Refusal--Compliance Tradeoff: A Large-Scale Safety Behavior Audit of Large Language Models

arXiv CS Tuesday 02 June 2026, 04:00 UTC By Alif Al Hasan, Sumon Biswas 1 min read

Key Points

arXiv:2605.05427v2 Announce Type: replace Abstract: Refusal rates are a poor proxy for LLM safety, i.e., a model may over-refuse benign prompts while still complying with harmful ones. We audit both failure modes across 21 open-weight LLMs on four safety benchmarks (OR-Bench, XSTest, ToxiGen, BOLD), using a composition adjustment to isolate model sensitivity from dataset toxicity confounds. We report three findings. First, models adopt fundamentally different calibration strategies: conservative ecosystems such as Llama suppress unsafe outputs at the cost of elevated over-refusals, while permissive ecosystems such as DeepSeek and Qwen preserve helpfulness but tolerate higher harmful compliance. Second, demographic protection is unequal: models over-protect prominent racial and religious groups, frequently refusing even benign prompts about them, while providing substantially weaker protection against disability-targeted attacks. Third, refusal and compliance tendencies are stable within model families across generations and scales, suggesting that post-training objectives shape safety behavior more than architecture. Our results call for joint, demographically-aware, and multi-judge safety evaluation.

LLM (ORG) XSTest (ORG) ToxiGen (LOCATION)

Originally published by arXiv CS Read original →

The Refusal--Compliance Tradeoff: A Large-Scale Safety Behavior Audit of Large Language Models

Related Stories

Here Comes the Age of the Rockstar CFO

Consumer prices rose 4.2% annually in May, highest in three years

Sweden’s Top Pension Gatekeeper Wants to Keep Private Credit Out

UK poised to ease steel tariffs as manufacturers warn of costs