Home › Knowledge Base › Benchmarking Vision-Language-Action Models

Benchmarking Vision-Language-Action Models

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

QuoVLA: Quotient Space for Vision-Language-Action Models

Announce Type: replace Abstract: Vision-Language-Action (VLA) models commonly adapt pretrained Vision-Language Models (VLMs) to robot control by mapping visual observations and language instructions to continuous actions. Existing approaches typically take an action-insufficiency view, assuming that pretrained VLM latents either lack directly usable action information or should be shielded from action-learning signals. Against this view, our \textit{Quotient Theory for VLA} shows that...

arXiv CS 1d ago

VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

Announce Type: replace Abstract: While Vision-Language-Action models (VLAs) are rapidly advancing towards generalist robot policies, it remains difficult to quantitatively understand their limits and failure modes. To address this, we introduce a comprehensive benchmark called VLA-Arena. We propose a novel structured task design framework to quantify difficulty across three orthogonal axes: (1) Task Structure, (2) Language Command, and (3) Visual Observation.

arXiv CS 7d ago

CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models

arXiv:2605.21854v2 Announce Type: replace Abstract: Vision-Language-Action (VLA) models have rapidly converged on a small set of architectural patterns: discrete-token autoregression (e.g. OpenVLA) and continuous-action flow-matching (e.g. pi-0.5). Yet preference alignment via Direct Preference Optimisation (DPO) -- the de-facto post-training step in language models -- has been studied almost exclusively on autoregressive VLAs. We present CrossVLA, an empirical study of cross-paradigm VLA...

arXiv CS 1d ago

FATE-VLA:Failue-aware test generation for vision-language-action models

arXiv:2606.02307v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models are increasingly used as generalist robot policies, yet their evaluation still relies largely on static benchmarks that randomly sample task scenes. In high-dimensional embodied spaces, failures are sparse and clustered, so static benchmarking can underestimate robustness risks. We reframe VLA evaluation as an active failure-discovery problem and propose a failure-aware test-generation approach that combines...

arXiv CS 8d ago

ImagineUAV: Aerial Vision-Language Navigation via World-Action Modeling and Kinodynamic Planning

Announce Type: new Abstract: Vision-language navigation (VLN) for UAVs demands grounding free-form instructions into 6-DoF flight under partial observability. While Vision-Language-Action (VLA) models excel at semantic reasoning, they suffer from brittleness due to geometric inconsistency and dynamics mismatch. To address this, we propose ImagineUAV, an imagination-driven framework leveraging cascaded world-action modeling.

arXiv CS 8d ago

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Announce Type: replace Abstract: Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA...

arXiv CS 8d ago

MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

Announce Type: new Abstract: Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal system to preserve episodic memory of past experience, and internal...

arXiv CS 1d ago

See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

arXiv:2603.09292v2 Announce Type: replace Abstract: Measurement of task progress through explicit, actionable milestones is critical for robust robotic manipulation. This progress awareness enables a model to ground its current task status, anticipate verifiable intermediate states, and detect and recover from failures when progress stalls. To embody this capability, we introduce \textbf{S}ee, \textbf{P}lan, \textbf{R}ewind (SPR), a progress-aware vision-language-action framework that...

arXiv CS 8d ago

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

arXiv:2605.31286v1 Announce Type: new Abstract: Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies...

arXiv CS 9d ago

WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation

arXiv:2606.06147v1 Announce Type: new Abstract: End-to-end Vision-Language-Action (VLA) models have shown promise in UAV navigation. However, existing approaches typically rely on historical observations to directly predict actions, often struggling in dense urban environments where severe occlusions and sharp turns result in drastic viewpoint transitions. We argue that the ability to "imagine" future states -- inherent in World Models -- is critical for robust decision-making under such...

arXiv CS 5d ago