Home › Knowledge Base › Benchmarking Study

Benchmarking Study

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

arXiv:2602.16763v2 Announce Type: replace Abstract: Artificial intelligence benchmarks are an important mechanism for measuring model progress and guiding deployment decisions. However, benchmarks quickly "saturate", making it difficult to differentiate models and diminishing their long-term value. In this study, we define benchmark saturation and analyze it across 60 language model benchmarks using 14 properties that relate to saturation.

arXiv CS 8d ago

Impact of RTK Augmentation and INS Integration on GNSS Positioning Accuracy and Continuity: A Benchmarking Study on Inland Waterways

arXiv:2606.06358v1 Announce Type: new Abstract: RTK augmentation andINS integration are widely used to improve GNSS positioning performance. However, on inland waterways, bridges and surrounding structures can degrade satellite visibility and correction availability, causing RTK augmentation loss, and GNSS/INS fusion transients.

arXiv CS 5d ago

Early Prediction of Liver Cirrhosis Up to Two Years in Advance: A Machine Learning Study Benchmarking Against the FIB-4 and APRI Scores

Announce Type: replace Abstract: Objective: Develop and evaluate machine learning (ML) models for predicting incident liver cirrhosis (LC) one and two years prior to diagnosis using routinely collected electronic health record (EHR) data and benchmark their performance against the FIB-4 and APRI clinical scores. Methods: We conducted a retrospective cohort study using de-identified EHR data from a large academic health system. XGBoost models were developed for 1- and 2-year prediction...

arXiv CS 8d ago

SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

arXiv:2605.13672v1 Announce Type: cross Abstract: Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in...

arXiv CS 6d ago

A Baseline Study and Benchmark for Few-Shot Open-Set Action Recognition with Feature Residual Discrimination

arXiv:2603.04125v2 Announce Type: replace Abstract: Few-Shot Action Recognition (FS-AR) has shown promising results but is often limited by a closed-set assumption that fails in real-world open-set scenarios. While Few-Shot Open-Set (FSOS) recognition is well-established for images, its extension to spatio-temporal video data remains underexplored. To address this, we propose an architectural extension based on a Feature-Residual Discriminator (FR-Disc), adapting previous work on skeletal...

arXiv CS 1d ago

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

arXiv:2512.23128v2 Announce Type: replace Abstract: Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking. Their reliance on dynamic web content, however, makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task. We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), a benchmark for studying how...

arXiv CS 2d ago

Declarative Outcome-Conformant Synthesis: Exact, Closed-Form Specification Satisfaction and a Conformance Benchmark

Announce Type: new Abstract: We study a capability the dominant paradigm in synthetic tabular data does not provide: exact satisfaction of a declared analytical outcome with no source data. Imitation methods (copulas, GANs, diffusion) learn a real distribution and sample from it, and are judged on fidelity to real data. A large, practical class of needs is different: generating data with no source data ("cold start") that reproduces a declared outcome (a revenue curve, a churn rate, a group...

arXiv CS 1d ago

Decomposing and Measuring Evaluation Awareness

Announce Type: replace Abstract: Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining validity of benchmark results. Yet the field studies it without a shared foundation, conflating properties of the evaluation with properties of the model, and detection with behavioral response. We ground evaluation awareness in social psychology, decomposing it into an environment component (how recognizable the task is) and a model component...

arXiv CS 7d ago

A Primer in Post-Training Reasoning Data: What We Know About How It Works

arXiv:2606.02113v1 Announce Type: new Abstract: Post-training has become a primary driver of recent progress in large reasoning models, and reasoning data are often the key variable determining whether this stage succeeds. Work on post-training reasoning data has grown rapidly, yet this literature remains scattered across dataset papers, reinforcement-learning recipes, reward-model studies, benchmarks, and frontier system reports. This paper is the first primer to synthesize over 150 key...

arXiv CS 8d ago

Analytic first-order non-adiabatic coupling matrix elements of spin-adapted open-shell time-dependent density functional theory

arXiv:2605.26594v2 Announce Type: replace Abstract: While spin-adapted time-dependent density functional theory (TDDFT) approaches significantly improve the excitation energies and gradients of open-shell molecules, the effect of spin-adaptation on non-adiabatic coupling matrix elements (NACMEs) remains unknown for spin-conserving excitations. In this article, we report the derivation, implementation and benchmark studies of the ground state-excited state and excited state-excited state...

arXiv Physics 6d ago