AI Rater
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making
arXiv:2606.03198v1 Announce Type: new Abstract: Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap through a factorial study of AI rater behavior in adult type 2 diabetes (T2D) pharmacotherapy at 12-month outpatient follow-up, a clinical task involving complex decision-making operationalized across seven evaluation...
Google just made you a search quality rater. You won't get paid
Google just made you a search quality rater. TL;DR AI is killing the click. Death of the click kills the incentive for links.
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
Announce Type: replace Abstract: Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases.
Contemporary AI lacks the imagination to diverge or negate in science
arXiv:2606.08251v1 Announce Type: new Abstract: Bold projections that artificial intelligence will accelerate scientific discovery have raced ahead of evidence from working scientists, and the field still lacks large-scale, scientist-in-the-loop tests of these claims. Here we mount the largest such evaluation to date and map what AI cannot yet do for science. We invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge follow-up ideas...
A Pathology Foundation Model for Gastric Cancer with Real-World Validation
arXiv:2606.04792v1 Announce Type: new Abstract: Gastric cancer remains a major cause of cancer mortality, yet its histological and molecular heterogeneity complicates diagnosis and risk stratification. General-purpose pathology foundation models (PFMs) often plateau on fine-grained endpoints central to gastric cancer care, and few have undergone rigorous prospective validation or clinical reader studies. We present GRACE, a Gastric-specific foundation model for Real-world Assessment and...
TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation
arXiv:2605.30673v1 Announce Type: new Abstract: Classroom videos contain observable teaching practices, but their pedagogical and visual signals are rarely organized in forms suitable for model evaluation. We present \textit{TeachObs}, a human-validated benchmark for multimodal teaching observation in classroom videos. \textit{TeachObs} includes 30 public lesson videos from eight countries divided into 5,158 fixed 15-second scenes.
Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents
Announce Type: new Abstract: Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other produces the same output regardless. We introduce the Causal Sensitivity Score (CSS), a pre-registered interventional metric that mutates oncology tumor-board cases along five clinically meaningful dimensions - biomarker flips,...
BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law
arXiv:2605.28183v2 Announce Type: replace Abstract: We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The BenGER dataset consists of three components: 596 exam-style free-text legal case tasks across multiple levels of legal education and 531 short doctrinal reasoning tasks. We evaluate 12 contemporary LLM systems -- closed flagship, efficiency-oriented, and open-weight -- across automatic and judge-based...
Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework
arXiv:2606.08867v1 Announce Type: new Abstract: The rapid rise in LLM capabilities has made AI agents increasingly viable across a broad range of tasks. Among the most promising applications is building production-ready customer-facing agents, a challenge that demands coordinated excellence in evaluation methodology, context engineering, training, and online measurement.
Framing, Judging, Steering: An Assessable Competency Model for Teach-ing Students to Reason With Generative AI
arXiv:2606.05983v1 Announce Type: new Abstract: Generative AI makes answers easy and understanding hard, and uncritical use invites cognitive offloading. Schools still measure unaided performance, yet the real task is to produce good work with AI: framing an ill-defined task, judging the output, and steering the model toward a better result. This ability is rarely assessed in its own right; where measured, it collapses into one "prompting" score that cannot diagnose why AI use succeeds or fails.