Home › Knowledge Base › Equipping Vision-Language-Action

Equipping Vision-Language-Action

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

Announce Type: new Abstract: Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal system to preserve episodic memory of past experience, and internal...

arXiv CS 1d ago

HapTile: A Haptic-Informed Vision-Tactile-Language-Action Dataset for Contact-Rich Imitation Learning

arXiv:2606.04825v1 Announce Type: new Abstract: Despite the importance of tactile sensing for reliable manipulation, most existing Vision-Language-Action (VLA) datasets remain vision-only, and those that do incorporate tactile information typically lack the joint combination of task diversity, language conditioning, and action trajectories. Furthermore, existing teleoperation pipelines rarely provide haptic feedback to the operator, despite its established role in demonstration quality and...

arXiv CS 6d ago

X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

arXiv:2605.24892v2 Announce Type: replace Abstract: Physical world knowledge resides mainly in videos. Equipping Vision-Language-Action (VLA) models with such knowledge is fundamental for safe and generalizable planning. Predictive world modeling enables VLA to internalize physical dynamics and long-term causality by predicting future video from past observations.

arXiv CS 8d ago

X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

arXiv:2605.24892v3 Announce Type: replace Abstract: Physical world knowledge resides mainly in videos. Equipping Vision-Language-Action (VLA) models with such knowledge is fundamental for safe and generalizable planning. Predictive world modeling enables VLA to internalize physical dynamics and long-term causality by predicting future video from past observations.

arXiv CS 1d ago

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

Announce Type: new Abstract: Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for...

arXiv CS 1d ago