Home › Knowledge Base › World Action Model

World Action Model

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry

arXiv:2604.01985v2 Announce Type: replace Abstract: General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning which primarily focuses on optimal actions, a world model needs to be reliable over a vast space of suboptimal actions, which are often underrepresented in action-labeled robot interactions. To address this challenge, we propose World Action Verifier (WAV), a...

arXiv CS 9d ago

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

arXiv:2606.08242v1 Announce Type: new Abstract: World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World...

arXiv CS 1d ago

Flash-WAM: Modality-Aware Distillation for World Action Models

Announce Type: new Abstract: World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with...

arXiv CS 5d ago

RynnVLA-002: A Unified Vision-Language-Action and World Model

arXiv:2511.17502v3 Announce Type: replace Abstract: We introduce RynnVLA-002, a unified Vision-Language-Action (VLA) and world model. The world model leverages action and visual inputs to predict future image states, learning the underlying physics of the environment to refine action generation. Conversely, the VLA model produces subsequent actions from image observations, enhancing visual understanding and supporting the world model's image generation.

arXiv CS 8d ago

Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation

arXiv:2606.08737v1 Announce Type: new Abstract: World action models inherit the predictive capability of world models, enabling action generation to be guided by anticipated future observations. However, they rely primarily on vision and often fail in contact-rich manipulation, where critical cues arise from physical interaction. In this paper, we propose Dream-Tac, a unified Tactile-World Action Model that jointly models actions, future visual observations, and tactile dynamics.

arXiv CS 1d ago

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

arXiv:2606.05979v1 Announce Type: new Abstract: We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities to solve complex long-horizon tasks as in...

arXiv CS 5d ago

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

arXiv:2606.09811v1 Announce Type: new Abstract: World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction...

arXiv CS 1d ago

Echo-Memory: A Controlled Study of Memory in Action World Models

arXiv:2606.09803v1 Announce Type: new Abstract: We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled...

arXiv CS 1d ago

OSCAR: Omni-Embodiment Skeleton-Conditioned World Action Model for Robotics

arXiv:2606.04463v1 Announce Type: new Abstract: We present OSCAR, a precise action-conditioned video world model that generalizes across different robot embodiments and enables robot policy evaluation. Existing video world models face three main challenges for real-world robot evaluation: limited scenario diversity in current robot training datasets, imprecise action following, and poor generalization across embodiments for broad adoption. We tackle these challenges from two perspectives.

arXiv CS 6d ago

WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation

arXiv:2606.06147v1 Announce Type: new Abstract: End-to-end Vision-Language-Action (VLA) models have shown promise in UAV navigation. However, existing approaches typically rely on historical observations to directly predict actions, often struggling in dense urban environments where severe occlusions and sharp turns result in drastic viewpoint transitions. We argue that the ability to "imagine" future states -- inherent in World Models -- is critical for robust decision-making under such...

arXiv CS 5d ago