Home › Knowledge Base › Evaluation

Evaluation

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations

arXiv:2511.05613v2 Announce Type: replace Abstract: Foundation models are increasingly central to high-stakes AI systems, and governance frameworks now depend on evaluations to assess their risks and capabilities. Although general capability evaluations are widespread, social impact assessments covering bias, fairness, privacy, environmental costs, and labor remain uneven. To characterize this landscape, we conduct the first comprehensive analysis of social impact evaluation reporting,...

arXiv CS 8d ago

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

arXiv:2606.09809v1 Announce Type: new Abstract: AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not...

arXiv CS 1d ago

Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning

Announce Type: replace Abstract: The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to assess the reliability of newly released models on unseen and unlabeled data. Existing evaluation pipelines typically rely on costly annotation, repeated fine-tuning, or assumptions that do not generalize well to new models. We introduce MetaEvaluator, a cost-effective, model-agnostic framework for fast, label-free...

arXiv CS 1d ago

Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning

arXiv CS 6d ago

Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection

arXiv:2606.01896v1 Announce Type: new Abstract: Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expensive, or biased. For hand detection, particularly in occupational safety settings, public datasets mostly contain bare hands.

arXiv CS 8d ago

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

arXiv:2606.01629v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of...

arXiv CS 8d ago

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Announce Type: replace Abstract: As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length;...

arXiv CS 7d ago

HMRC Evaluation Framework

HMRC Evaluation Framework The framework sets out HMRC's evaluation approach and how it fits with wider government best practice. This framework was updated in 2026 — click here to read the new page. The evaluation framework sets out our approach for achieving HMRC’s evaluation vision of good quality monitoring and evaluations of policies, programmes and projects in line with government good practice.

GOV.UK Statistics 4d ago

Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

arXiv:2606.06788v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface-specific criteria. We propose a new evaluation framework based on a formative study with $16$...

arXiv CS 2d ago

Query-efficient model evaluation using cached responses

Announce Type: replace Abstract: Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice, responses from previously-evaluated models are often cached -- creating a potential opportunity to use this additional information to decrease the number of queries required to accurately evaluate a new model.

arXiv CS 5d ago