Home Knowledge Base AI Rater

AI Rater

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

arXiv:2606.03198v1 Announce Type: new Abstract: Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap through a factorial study of AI rater behavior in adult type 2 diabetes (T2D) pharmacotherapy at 12-month outpatient follow-up, a clinical task involving complex decision-making operationalized across seven evaluation...

arXiv CS 7d ago

Google just made you a search quality rater. You won't get paid

Google just made you a search quality rater. TL;DR AI is killing the click. Death of the click kills the incentive for links.

Hacker News 3d ago

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

Announce Type: replace Abstract: Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases.

arXiv CS 7d ago

Contemporary AI lacks the imagination to diverge or negate in science

arXiv:2606.08251v1 Announce Type: new Abstract: Bold projections that artificial intelligence will accelerate scientific discovery have raced ahead of evidence from working scientists, and the field still lacks large-scale, scientist-in-the-loop tests of these claims. Here we mount the largest such evaluation to date and map what AI cannot yet do for science. We invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge follow-up ideas...

arXiv CS 1d ago

A Pathology Foundation Model for Gastric Cancer with Real-World Validation

arXiv:2606.04792v1 Announce Type: new Abstract: Gastric cancer remains a major cause of cancer mortality, yet its histological and molecular heterogeneity complicates diagnosis and risk stratification. General-purpose pathology foundation models (PFMs) often plateau on fine-grained endpoints central to gastric cancer care, and few have undergone rigorous prospective validation or clinical reader studies. We present GRACE, a Gastric-specific foundation model for Real-world Assessment and...

arXiv CS 6d ago

TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation

arXiv:2605.30673v1 Announce Type: new Abstract: Classroom videos contain observable teaching practices, but their pedagogical and visual signals are rarely organized in forms suitable for model evaluation. We present \textit{TeachObs}, a human-validated benchmark for multimodal teaching observation in classroom videos. \textit{TeachObs} includes 30 public lesson videos from eight countries divided into 5,158 fixed 15-second scenes.

arXiv CS 9d ago

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

Announce Type: new Abstract: Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other produces the same output regardless. We introduce the Causal Sensitivity Score (CSS), a pre-registered interventional metric that mutates oncology tumor-board cases along five clinically meaningful dimensions - biomarker flips,...

arXiv CS 9d ago

BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

arXiv:2605.28183v2 Announce Type: replace Abstract: We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The BenGER dataset consists of three components: 596 exam-style free-text legal case tasks across multiple levels of legal education and 531 short doctrinal reasoning tasks. We evaluate 12 contemporary LLM systems -- closed flagship, efficiency-oriented, and open-weight -- across automatic and judge-based...

arXiv CS 8d ago

Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework

arXiv:2606.08867v1 Announce Type: new Abstract: The rapid rise in LLM capabilities has made AI agents increasingly viable across a broad range of tasks. Among the most promising applications is building production-ready customer-facing agents, a challenge that demands coordinated excellence in evaluation methodology, context engineering, training, and online measurement.

arXiv CS 1d ago

Framing, Judging, Steering: An Assessable Competency Model for Teach-ing Students to Reason With Generative AI

arXiv:2606.05983v1 Announce Type: new Abstract: Generative AI makes answers easy and understanding hard, and uncritical use invites cognitive offloading. Schools still measure unaided performance, yet the real task is to produce good work with AI: framing an ill-defined task, judging the output, and steering the model toward a better result. This ability is rarely assessed in its own right; where measured, it collapses into one "prompting" score that cannot diagnose why AI use succeeds or fails.

arXiv CS 5d ago