Home › Knowledge Base › Semantic Grounding in Action Prediction

Semantic Grounding in Action Prediction

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

Announce Type: new Abstract: Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an embodied benchmark for diagnosing semantic grounding in action...

arXiv CS 8d ago

Semantic Partial Grounding via LLMs

arXiv:2602.22067v2 Announce Type: replace Abstract: Grounding is a critical step in classical planning, yet it often becomes a computational bottleneck due to the exponential growth in grounded actions and atoms as task size increases. Recent advances in partial grounding have addressed this challenge by incrementally grounding only the most promising operators, guided by predictive models. However, these approaches primarily rely on relational features or learned embeddings and do not...

arXiv CS 5d ago

WALL-WM: Carving World Action Modeling at the Event Joints

arXiv:2606.01955v1 Announce Type: new Abstract: WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this...

arXiv CS 8d ago

GeoAlign: Beyond Semantics with State-Guided Spatial Alignment in VLA Models

arXiv:2606.03240v1 Announce Type: new Abstract: Current Vision--Language--Action (VLA) models often optimize for semantic grounding, whereas executable manipulation requires geometry-aware spatial alignment and dynamic affordance selection. We introduce GeoAlign, a state-guided spatial alignment architecture for VLA policy learning. GeoAlign post-trains an RGB geometry branch with robot-domain RGB-D supervision, yielding RGB-derived Geometry-Enhanced Post-Trained (GEP) features for policy...

arXiv CS 7d ago

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

arXiv:2606.01621v1 Announce Type: new Abstract: Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding.

arXiv CS 8d ago

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

arXiv:2606.06155v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose \textbf{AffordanceVLA}, a unified framework that introduces structured affordance...

arXiv CS 5d ago

Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation

Announce Type: new Abstract: Embodied chain-of-thought (CoT) aims to bridge linguistic reasoning and robotic control, but its effective form and integration strategy remain underexplored. In this paper, we revisit embodied CoT for vision-language-action (VLA) models at large scale. We construct the largest embodied CoT corpus to date, comprising 978,743 trajectories, 226.3M samples, and 2592.5 hours of robot data.

arXiv CS 7d ago

Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation

arXiv:2606.03784v2 Announce Type: replace Abstract: Embodied chain-of-thought (CoT) aims to bridge linguistic reasoning and robotic control, but its effective form and integration strategy remain underexplored. In this paper, we revisit embodied CoT for vision-language-action (VLA) models at large scale. We construct the largest embodied CoT corpus to date, comprising 978,743 trajectories, 226.3M samples, and 2592.5 hours of robot data.

arXiv CS 6d ago

How climate shapes the meanings of words across languages

How climate shapes the meanings of words across languages Lisa Lock Scientific Editor Robert Egan Associate Editor When English speakers say "rose" and Chinese speakers say "玫瑰," do they mean the same thing? A Peking University team led by Professor Bi Yanchao explored this question using word embeddings from 53 languages, behavioral ratings from speakers of eight languages and exploratory multilingual brain imaging data. Published in Nature Communications, the study shows that word meanings...

Phys.org 23h ago

Feat2Go: Visual Feature-Grounded Value Estimation for Embodied Reinforcement Learning

arXiv:2605.30795v1 Announce Type: new Abstract: Reinforcement learning is a promising approach for improving the capabilities of vision-language-action (VLA) models while avoiding the heavy data requirements of imitation learning. However, its effectiveness for VLA models is often constrained by sparse supervision and the difficulty of designing informative reward signals for long-horizon manipulation. In this work, we present Feat2Go, a fine-grained value estimation framework for embodied...

arXiv CS 9d ago