Home › Knowledge Base › Reliability through Evaluation Transparency

Reliability through Evaluation Transparency

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation

arXiv:2606.07936v1 Announce Type: new Abstract: Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols -- details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023--2025, including a full...

arXiv CS 1d ago

Lessons from the Trenches on Reproducible Evaluation of Language Models

arXiv:2405.14782v3 Announce Type: replace Abstract: Reliable evaluation of language models (LMs) remains an open challenge. Re- searchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. Evaluation difficulties are exacer- bated by the fracturing and siloing of information about conventions and common practices.

arXiv CS 8d ago

China launches AI framework to improve ‘black box’ transparency and raise standards

China launches AI framework to improve ‘black box’ transparency and raise standards The initiative underscores Beijing’s growing focus on AI governance, as concerns grow over algorithm bias and data security China has pledged to improve the accuracy, reliability and transparency of AI through a new national evaluation framework, as policymakers move to establish common standards for assessing the fast-evolving technology. New guidelines released by the central government said Beijing would...

South China Morning Post 11d ago

iML: Executable, Problem-Grounded, and Broadly Exploratory Code-Driven AutoML

Announce Type: replace Abstract: Automated Machine Learning (AutoML) has improved access to machine learning, yet existing techniques often remain limited in flexibility, transparency, and execution reliability. Code-driven AutoML offers a promising direction by synthesizing executable code for preprocessing, model training, and evaluation. However, current LLM-based approaches frequently generate code that is plausible in text yet brittle in execution, insufficiently grounded in the actual...

arXiv CS 8d ago

Pitfalls of Evaluating Language Models with Open Benchmarks

arXiv:2507.00460v3 Announce Type: replace Abstract: Open Large Language Model (LLM) benchmarks, such as HELM and BIG-Bench, provide standardized and transparent evaluation protocols that support comparative analysis, reproducibility, and systematic progress tracking in Language Model (LM) research. Yet, this openness also creates substantial risks of data leakage during LM testing--deliberate or inadvertent, thereby undermining the fairness and reliability of leaderboard rankings and leaving...

arXiv CS 5d ago

Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities

Announce Type: replace Abstract: Healthcare disparities persist across socioeconomic boundaries, often attributed to unequal access to screening, diagnostics, and therapeutics. However, this perspective highlights that critical biases can emerge much earlier, during data collection and research prioritization, long before clinical implementation, particularly in studies focused on molecular and omics data. A vast number of studies focus on collecting omics data, but the demographic...

arXiv CS 8d ago

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

arXiv:2606.03305v1 Announce Type: new Abstract: Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data membership exist, but have been validated almost exclusively in controlled academic regimes: large, homogeneous pre-training corpora and transparent, single-stage training pipelines. Whether these methods remain reliable in realistic auditing scenarios remains unclear.

arXiv CS 7d ago

A Methodological Framework for Explicit Control of the Speed-Accuracy Trade-off in Brain-Computer Interfaces

Announce Type: cross Abstract: Brain-computer interfaces (BCIs) are limited by low signal-to-noise ratio in modalities such as electroencephalography, which requires multiple trials to reliably decode user intentions. This induces a speed-accuracy trade-off, whereby higher accuracy comes at the cost of speed. The speed-accuracy balance is application-dependent, motivating controllable trade-offs.

arXiv CS 8d ago

BIAS-ID: A Framework for Analyzing Transformation Biases in AI-Generated Image Detectors

arXiv:2605.31153v1 Announce Type: new Abstract: Given the surge of harmful AI-generated imagery online, reliably distinguishing authentic images from generated ones has become an urgent research topic. While many proposed detection methods perform well under controlled settings, they often collapse when tested on real-world data.

arXiv CS 9d ago

Cab-less electric trucks hit Ohio roads

A freight truck with no driver, no cab and no one sitting behind the wheel is starting to sound more familiar. In fact, this summer, that is exactly what is happening on local roads in Marysville, Ohio.EASE Logistics, an Ohio-based logistics company, is partnering with autonomous truck technology company Einride to deploy two cab-less electric trucks between EASE warehouse locations. The two companies recently announced the proof-of-concept service.The trucks will operate on EASE property...

Fox News Tech 13d ago