ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

arXiv CS Monday 08 June 2026, 04:00 UTC By Jindi Lv, Hao Li, Jie Li, Fankun Kong, Yang Wang, Pengfei Yi, Yifei Nie, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, Hengtao Li, Jiancheng Lv, Guan Huang 1 min read

Key Points

arXiv:2604.08168v2 Announce Type: replace Abstract: Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics and physical interactions, undermining reliable value estimation in long-horizon tasks. In this paper, we propose ViVa, a video-generative value model that repurposes a pretrained video generator to jointly predict future proprioception and a scalar value. By grounding value estimation in anticipated embodiment dynamics, ViVa leverages spatiotemporal priors to intrinsically couple value with foresight beyond static snapshots. ViVa achieves state-of-the-art results in metric-based evaluation across three tasks, producing reliable value signals that accurately track task progress and detect execution errors. Integrated into RECAP, it achieves an average success rate of 80%, highlighting the promise of video-generative models for value estimation.

VLA (ORG) ViVa (ORG) RECAP (ORG)

Originally published by arXiv CS Read original →

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

Related Stories

Apollo Wraps Up $35B Chip Deal for Anthropic

CoreWeave’s Credit Rebound Drives Cheaper Data Center Funding

AI windfall for the public? Trump signals shake-up for tech giants

Microsoft limits employee use of Anthropic's Claude Fable 5 over data retention concerns, The Verge reports