Home Knowledge Base A Comprehensive Benchmark

A Comprehensive Benchmark

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

arXiv:2606.00793v2 Announce Type: replace Abstract: Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion...

arXiv CS 1d ago

scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation

new Abstract: Simultaneous measurement of multiple omics modalities in single cells enables researchers to gain a more comprehensive understanding of cellular states and regulatory mechanisms. However, due to high experimental costs, significant noise, and incomplete modality coverage, a variety of computational methods for modality translation have emerged in recent years. Despite the development of translation models, there is still a lack of systematic benchmark evaluation in terms of...

arXiv CS 7d ago

TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents

arXiv:2606.01046v1 Announce Type: new Abstract: The development of Large Language Models (LLMs) has significantly improved travel planning applications, yet evaluating such models is limited by existing benchmarks' limitations: 1) overemphasis on constraint compliance, neglecting multi-dimensional qualities like spatio-temporal cost; 2) datasets lacking real-world authenticity and coverage in key areas (e.g., lodging, transport); and 3) isolated daily plan assessments that miss critical...

arXiv CS 8d ago

GraphARC: A Comprehensive Benchmark for Graph-Based Abstract Reasoning

arXiv:2605.31031v1 Announce Type: new Abstract: Relational reasoning lies at the heart of intelligence, but existing benchmarks are typically confined to formats such as grids or text. We introduce GraphARC, a benchmark for abstract reasoning on graph-structured data. GraphARC generalizes the few-shot transformation learning paradigm of the Abstraction and Reasoning Corpus (ARC).

arXiv CS 9d ago

Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing

arXiv:2606.01393v1 Announce Type: new Abstract: Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, existing Optical Character Recognition (OCR) and document parsing benchmarks are increasingly limited in coverage and difficulty: many focus on common document genres or uniformly sampled pages where modern parsers already perform strongly, while offering limited annotation for expert-domain structures such...

arXiv CS 8d ago

CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval

arXiv:2506.11066v3 Announce Type: replace Abstract: Code retrieval is essential in modern software development, as it boosts code reuse and accelerates debugging. However, current benchmarks primarily emphasize functional relevance while neglecting critical dimensions of software quality. Motivated by this gap, we introduce CoQuIR, the first large-scale, multilingual benchmark specifically designed to evaluate quality-aware code retrieval across four key dimensions: correctness, efficiency,...

arXiv CS 2d ago

TransportBench: A Comprehensive Benchmark for Non-Equilibrium Flow Transport

Announce Type: new Abstract: Scientific machine learning models, as versatile tools for numerical simulation and analysis, are increasingly transforming the landscape of fluid mechanics research. However, existing datasets and benchmarks are primarily limited to continuum fluids and provide limited support for non-equilibrium transport phenomena. To address this gap, we present TransportBench, a high-fidelity dataset and standardized benchmark for non-equilibrium flow transport, designed to...

arXiv Physics 7d ago

Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark

Announce Type: replace Abstract: Weakly supervised anomaly detection (WSAD) has developed in three primary directions: incomplete, inexact, and inaccurate supervision. However, these directions remain isolated, lacking a unified framework to assess whether they address unique challenges or share fundamental mechanisms. This paper introduces WSADBench, the first benchmark that unifies evaluation across distinct weakly supervised scenarios, benchmarking diverse approaches from specialized WSAD...

arXiv CS 8d ago

ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

arXiv:2605.31251v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through...

arXiv CS 9d ago

V2V-Bench: A Comprehensive Benchmark for Video-to-Video Generation Evaluation

Announce Type: new Abstract: Video-to-video (V2V) generation is difficult to evaluate because outputs must both follow editing instructions and preserve frame-level correspondence with the source video, which existing T2V and I2V metrics do not capture. We introduce V2V-Bench, a 11-dimension benchmark organized into five categories: temporal alignment, structural fidelity, transformation quality, video quality, and semantic alignment. V2V-Bench pairs diverse source videos with challenging...

arXiv CS 5d ago