Evaluation
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations
arXiv:2511.05613v2 Announce Type: replace Abstract: Foundation models are increasingly central to high-stakes AI systems, and governance frameworks now depend on evaluations to assess their risks and capabilities. Although general capability evaluations are widespread, social impact assessments covering bias, fairness, privacy, environmental costs, and labor remain uneven. To characterize this landscape, we conduct the first comprehensive analysis of social impact evaluation reporting,...
Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting
arXiv:2606.09809v1 Announce Type: new Abstract: AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not...
Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning
Announce Type: replace Abstract: The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to assess the reliability of newly released models on unseen and unlabeled data. Existing evaluation pipelines typically rely on costly annotation, repeated fine-tuning, or assumptions that do not generalize well to new models. We introduce MetaEvaluator, a cost-effective, model-agnostic framework for fast, label-free...
Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning
Announce Type: replace Abstract: The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to assess the reliability of newly released models on unseen and unlabeled data. Existing evaluation pipelines typically rely on costly annotation, repeated fine-tuning, or assumptions that do not generalize well to new models. We introduce MetaEvaluator, a cost-effective, model-agnostic framework for fast, label-free...
Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection
arXiv:2606.01896v1 Announce Type: new Abstract: Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expensive, or biased. For hand detection, particularly in occupational safety settings, public datasets mostly contain bare hands.
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
arXiv:2606.01629v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of...
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
Announce Type: replace Abstract: As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length;...
HMRC Evaluation Framework
HMRC Evaluation Framework The framework sets out HMRC's evaluation approach and how it fits with wider government best practice. This framework was updated in 2026 — click here to read the new page. The evaluation framework sets out our approach for achieving HMRC’s evaluation vision of good quality monitoring and evaluations of policies, programmes and projects in line with government good practice.
Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses
arXiv:2606.06788v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface-specific criteria. We propose a new evaluation framework based on a formative study with $16$...
Query-efficient model evaluation using cached responses
Announce Type: replace Abstract: Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice, responses from previously-evaluated models are often cached -- creating a potential opportunity to use this additional information to decrease the number of queries required to accurately evaluate a new model.