Home › Knowledge Base › Benchmarking Robots

Benchmarking Robots

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

RoboBenchMart: Benchmarking Robots in Retail Environment

Announce Type: replace Abstract: Most existing robotic manipulation benchmarks focus on tabletop or household scenarios. While these setups have driven impressive progress, it remains unclear whether generalist VLAs that excel there can truly generalize to domains with different geometry, semantics, and workflows. We introduce RoboBenchMart, an open-source simulated benchmark targeting retail dark-store environments, where a mobile manipulator must perform complex manipulation tasks with...

arXiv CS 8d ago

What Are We Actually Benchmarking in Robot Manipulation?

arXiv:2606.04233v1 Announce Type: new Abstract: A robotics benchmark score measures success under one fixed evaluation setup, yet is routinely treated as evidence of general manipulation capability. We identify four failure modes, each of which weakens or invalidates a benchmark's role as a valid proxy for that capability: shortcut solvability, lack of statistical significance, creeping overfitting, and data-source dependence.

arXiv CS 6d ago

Assistax: A Multi-Agent Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics

arXiv:2507.21638v2 Announce Type: replace Abstract: The development of reinforcement learning (RL) algorithms has been largely driven by ambitious challenge tasks and benchmarks. Games have dominated RL benchmarks because they present relevant challenges, are inexpensive to run and easy to understand. While games such as Go and Atari have led to many breakthroughs, they often do not directly translate to real-world embodied applications.

arXiv CS 7d ago

RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation

arXiv:2606.01600v1 Announce Type: new Abstract: Video world models are increasingly used in robotic manipulation, yet existing benchmarks mostly evaluate them under valid, feasible, and safe instructions. We introduce RoboTrustBench, a benchmark for evaluating the trustworthiness of video world models under four scenarios: Normal, Constraint-Sensitive, Counterfactual, and Adversarial. Built from real-world DROID episodes, RoboTrustBench contains 1,207 expert-validated instruction-image pairs...

arXiv CS 8d ago

RAD: A Dataset and Benchmark for Real-Life Anomaly Detection with Robotic Observations

arXiv:2410.00713v4 Announce Type: replace Abstract: Anomaly detection is a core capability for robotic perception and industrial inspection, yet most existing benchmarks are collected under controlled conditions with fixed viewpoints and stable illumination, failing to reflect real deployment scenarios. We introduce RAD (Realistic Anomaly Detection), a robot-captured, multi-view dataset designed to stress pose variation, reflective materials, and viewpoint-dependent defect visibility. RAD...

arXiv CS 1d ago

RobotValues: Evaluating Household Robots When Human Values Conflict

Announce Type: new Abstract: While household robots are often evaluated based on task completion, everyday domestic environments involve value-conflicting situations in which robots are expected to choose actions that prioritize other values than task success, such as human autonomy, efficiency, or social appropriateness. Yet, there are no benchmarks for evaluating robots' value preferences in such scenarios. We introduce RobotValues, a benchmark to evaluate household robot planners in 10K...

arXiv CS 7d ago

Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis

arXiv:2606.08881v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily conducted in simulation or on expensive robotic platforms, leaving their robustness on affordable real-world robots largely unexplored. We present a standardized real-world benchmark for evaluating representative VLA and imitation learning policies on the low-cost SO-101 robotic platform. The benchmark...

arXiv CS 1d ago

SIMPLE: Simulation-Based Policy Learning and Evaluation for Humanoid Loco-manipulation

Announce Type: new Abstract: Humanoid foundation models are advancing faster than we can evaluate them. While real-world testing is expensive and difficult to reproduce, existing simulation benchmarks focus primarily on table-top or wheeled robots. A scalable and reproducible benchmark for whole-body humanoid loco-manipulation remains an open problem.

arXiv CS 1d ago

IR-SIM: A Lightweight Skill-Native Simulator for Navigation, Learning, and Benchmarking

arXiv:2606.08729v1 Announce Type: new Abstract: Simulation plays a key role in automated robotics research supported by large language models (LLMs). However, existing simulators often require custom code or complex interfaces, creating a barrier to rapid prototyping and automated algorithm development. To this end, we propose the Intelligent Robot Simulator (IR-SIM), a lightweight skill-native navigation simulator designed for rapid scenario construction, benchmarking, and robot learning.

arXiv CS 1d ago

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

arXiv:2606.04773v1 Announce Type: new Abstract: Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leaving them unable to diagnose where current models fail. To bridge this gap, we introduce NextMotionQA, a comprehensive benchmark that leverages vision-language models...

arXiv CS 6d ago