Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Yongtaek Lim, Hyeji Choi, Minwoo Kim 1 min read

Key Points

Announce Type: new Abstract: Safety judges are increasingly deployed to evaluate model outputs against evolving criteria, yet recent meta-evaluation work shows they remain brittle under prompt and rubric variation, with false negative-rate swings of up to 0.24 reported for stylistic perturbations alone. We argue that safety judgment is fundamentally a rubric-following problem: a robust judge must apply the given evaluation criteria consistently across rubric formulations rather than memorize...

arXiv:2606.09165v1 Announce Type: new Abstract: Safety judges are increasingly deployed to evaluate model outputs against evolving criteria, yet recent meta-evaluation work shows they remain brittle under prompt and rubric variation, with false negative-rate swings of up to 0.24 reported for stylistic perturbations alone. We argue that safety judgment is fundamentally a rubric-following problem: a robust judge must apply the given evaluation criteria consistently across rubric formulations rather than memorize one specific template. We propose a training strategy that combines (i) instance-conditioned dynamic rubrics generated from prompt-response-label triples to expose the judge to the variability of evaluation criteria, and (ii) a reliable-to-expressive curriculum that begins with clean fixed-rubric supervision and progressively introduces noisier dynamic-rubric data. We evaluate on a single human-labeled set under three contrasting rubric prompts (HarmBench-style, ShieldGemma-style, and a domain-specific rubric). Our 12B curriculum judge achieves 94.12-94.88% accuracy across the three rubrics with a cross-rubric range of only 0.76, outperforming general-purpose LLMs, dedicated safety classifiers, and reasoning-oriented judges up to 30B in both peak accuracy and stability. An ablation shows that naively mixing dynamic rubrics into SFT increases cross rubric variance (1.44 -> 3.60); only the curriculum schedule recovers and improves on the fixed rubric baseline (variance 0.76).

HarmBench (ORG) SFT (ORG)

Originally published by arXiv CS Read original →

Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges

Related Stories

School uniform charity plans fundraising week

Students' data taken in major university cyber-attack

Students' data taken in major university cyber-attack

Anthropic CEO Doesn’t Know If Claude Used in Iran School Strike