Home Knowledge Base Bench

Bench

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

arXiv:2606.05661v1 Announce Type: new Abstract: Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience.

arXiv CS 5d ago

CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models

Announce Type: new Abstract: This paper introduces Code Bench, a benchmark capable of evaluating Large Language Models (LLMs) concise code generation abilities in 60 programming languages. Based on code golf, a recreational programming competition focused on minimal character or byte solutions, the benchmark provides a distinctive measure of LLMs ability to produce efficient, concise code. Unlike existing benchmarks limited by fixed problem sets and language coverage, CodeGolf Bench...

arXiv CS 9d ago

V2V-Bench: A Comprehensive Benchmark for Video-to-Video Generation Evaluation

Announce Type: new Abstract: Video-to-video (V2V) generation is difficult to evaluate because outputs must both follow editing instructions and preserve frame-level correspondence with the source video, which existing T2V and I2V metrics do not capture. We introduce V2V-Bench, a 11-dimension benchmark organized into five categories: temporal alignment, structural fidelity, transformation quality, video quality, and semantic alignment. V2V-Bench pairs diverse source videos with challenging...

arXiv CS 5d ago

TadA-Bench: A Million-Variant Benchmark for Future-Round Discovery Toward Agentic Protein Engineering

Announce Type: cross Abstract: AI for scientific discovery is entering an agentic era, where protein-engineering systems are expected to prioritize future wet-lab experiments rather than merely fit static measurements. We introduce TadA-Bench, a million-variant wet-lab replay benchmark from 31 TadA directed-evolution rounds for future-round discovery toward agentic protein engineering. TadA-Bench preserves the campaign chronology and defines a fixed-data replay task: given earlier...

arXiv CS 7d ago

CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation

Announce Type: new Abstract: Evaluating LLM agents in realistic service scenarios requires complex task dependencies, imperfect user behavior, and an evaluation that accommodates multiple valid solutions. We introduce CRAB-Bench (Constraint-based Realistic Agent Benchmark) and RUSE (Realistic User Simulation Engine) to address this gap. CRAB-Bench generates tasks via a constraint graph over multiple interdependent entities with structured distractors, requiring agents to reason carefully...

arXiv CS 8d ago

Timber on the bench, Havertz starts for Arsenal against PSG in Champions League final

Timber on the bench, Havertz starts for Arsenal against PSG in Champions League final BUDAPEST, May 30 : Arsenal full back Jurrien Timber, who has been out since March with a groin injury but was declared fit again, will not start Saturday’s Champions League final against Paris St Germain while Kai Havertz has been picked to start as the lone striker. Timber is on the bench with Cristhian Mosquera starting at right back. The 19-year-old Myles Lewis-Skelly also starts in midfield in place of...

Channel News Asia 10d ago

Messi comes off the bench to score in Argentina’s final World Cup warm-up

Messi comes off the bench to score in Argentina’s final World Cup warm-up Argentina ease past Iceland in their final friendly before the World Cup, winning 3-0 in Auburn, Alabama. Lionel Messi came off the bench and scored a penalty as Argentina wrapped up their World Cup preparations with a comfortable 3-0 victory over Iceland in Auburn, in the US state of Alabama. Messi came on in the 70th minute and set up a penalty kick with his first touch of the match, before converting the spot kick...

Al Jazeera 5h ago

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

arXiv:2606.09323v1 Announce Type: new Abstract: Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so models from different training paradigms are difficult to compare directly even when they operate on similar tabular signals. We introduce TRL-Bench, a multi-granular tabular representation learning (TRL) benchmark that standardizes cross-paradigm representation-level evaluation: each encoder exports row-, column-, or table embeddings through its supported...

arXiv CS 1d ago

$\Psi$-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues

Announce Type: new Abstract: Personalization is a crucial capability of modern language agents. However, current research primarily positions personalized agents as passive responders to user preferences, limiting their ability to interact with users and provide suggestions or guidance proactively. To systematically evaluate such proactive personalization in realistic interactions, we propose $\Psi$-Bench, a benchmark for assessing LLMs' ability to influence realistic users through conversation.

arXiv CS 7d ago

LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models

Announce Type: new Abstract: Large-scale generative models have demonstrated remarkable capabilities across image generation and editing tasks. However, their performance in low-level vision tasks, which require pixel-wise control, remains insufficiently studied. To address this gap, we introduce \textbf{LL-Bench}, a comprehensive \textbf{Benchmark} for evaluating the capabilities of large-scale generative models on \textbf{L}ow-\textbf{L}evel vision tasks.

arXiv CS 8d ago