Home › Knowledge Base › the Unified World Model

the Unified World Model

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

arXiv:2606.05979v1 Announce Type: new Abstract: We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities to solve complex long-horizon tasks as in...

arXiv CS 5d ago

Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation

arXiv:2606.08737v1 Announce Type: new Abstract: World action models inherit the predictive capability of world models, enabling action generation to be guided by anticipated future observations. However, they rely primarily on vision and often fail in contact-rich manipulation, where critical cues arise from physical interaction. In this paper, we propose Dream-Tac, a unified Tactile-World Action Model that jointly models actions, future visual observations, and tactile dynamics.

arXiv CS 1d ago

RynnVLA-002: A Unified Vision-Language-Action and World Model

arXiv:2511.17502v3 Announce Type: replace Abstract: We introduce RynnVLA-002, a unified Vision-Language-Action (VLA) and world model. The world model leverages action and visual inputs to predict future image states, learning the underlying physics of the environment to refine action generation. Conversely, the VLA model produces subsequent actions from image observations, enhancing visual understanding and supporting the world model's image generation.

arXiv CS 8d ago

WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation

arXiv:2606.04907v1 Announce Type: new Abstract: Visual navigation requires generating smooth and collision-free trajectories under complex geometric and physical constraints. Existing reactive policies that directly map observations to actions lack anticipatory reasoning, limiting their ability to proactively avoid obstacles. While visual imagination offers predictive foresight, conventional modular approaches separate scene prediction from policy learning, often leading to error...

arXiv CS 6d ago

Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks

arXiv:2606.08775v1 Announce Type: new Abstract: Visual world models have shown great potential in learning complex system dynamics. Recent advancements leverage these models as transition functions within Model Predictive Control (MPC) frameworks to solve various control tasks. When applied to robotics, however, they are limited to single-stage tasks such as reaching or grasping, and struggle with multi-stage ones that demand complex sequential planning.

arXiv CS 1d ago

WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

arXiv:2512.10958v2 Announce Type: replace Abstract: Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and...

arXiv CS 8d ago

WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching

arXiv:2603.06331v2 Announce Type: replace Abstract: Diffusion-based world models have shown strong potential for unified world simulation, but the iterative denoising remains too costly for interactive use and long-horizon rollouts. While feature caching can accelerate inference without training, we find that policies designed for single-modal diffusion transfer poorly to world models due to two world-model-specific obstacles: \emph{token heterogeneity} from multi-modal coupling and spatial...

arXiv CS 8d ago

Attend to Anything: Foundation Model for Unified Human Attention Modeling

arXiv:2606.03540v1 Announce Type: new Abstract: Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generalize in real-world applications. To address the fundamental limitations, we present the Attend to Anything Model (AAM), a multi-modal foundation model...

arXiv CS 7d ago

Cosmos 3: Omnimodal World Models for Physical AI

arXiv:2606.02800v2 Announce Type: replace Abstract: We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a...

arXiv CS 1d ago

Cosmos 3: Omnimodal World Models for Physical AI

arXiv:2606.02800v1 Announce Type: new Abstract: We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a...

arXiv CS 7d ago