Home Science World2Act: Latent Action Post-Training from World Model Dynamics
Science

World2Act: Latent Action Post-Training from World Model Dynamics

Key Points

arXiv:2603.10422v2 Announce Type: replace Abstract: World Models (WMs) offer a promising mechanism for post-training Vision-Language-Action (VLA) policies by providing dynamics priors that improve generalization under task and scene variation. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to visual artifacts introduced by imperfect WM rollouts. We present World2Act, a latent-space post-training framework that transfers WM dynamics to...

arXiv:2603.10422v2 Announce Type: replace Abstract: World Models (WMs) offer a promising mechanism for post-training Vision-Language-Action (VLA) policies by providing dynamics priors that improve generalization under task and scene variation. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to visual artifacts introduced by imperfect WM rollouts. We present World2Act, a latent-space post-training framework that transfers WM dynamics to the VLA policy without pixel-space supervision. World2Act operates in two stages: 1) it induces a shared video-action latent space by contrastively aligning WM-dynamics latents with action embeddings, and 2) it post-trains the VLA by guiding policy action representations toward WM-imagined dynamics rather than decoded pixels. Built on GR00T-N1.6, World2Act delivers absolute success-rate gains of up to +2.5% on simulation benchmarks (RoboCasa, LIBERO, Bridge-SIMPLER) and +6.7% on a real robot over finetuned VLA baselines. Notably, it outperforms pixel-space WM supervision by up to +6.0%, including on LIBERO where pixel supervision degrades the baseline, suggesting that latent WM dynamics offer a more stable WM-based post-training alternative to pixel-space transfer.
World Model Dynamics arXiv:2603.10422v2 (ORG) WM (ORG) VLA (ORG) World2Act (ORG) RoboCasa (ORG) LIBERO (ORG)
Originally published by arXiv CS Read original →