Towards Sparse Video Understanding and Reasoning

arXiv CS Tuesday 02 June 2026, 04:00 UTC By Chenwei Xu, Zhen Ye, Shang Wu, Weijian Li, Zihan Wang, Zhuofan Xia, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu 1 min read

Key Points

arXiv:2602.13602v2 Announce Type: replace Abstract: We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play'' setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.

Originally published by arXiv CS Read original →

Towards Sparse Video Understanding and Reasoning

Related Stories

SpaceX Price Tag is 'Very Steep': Renaissance's Kennedy

Thinking about insider trading on prediction markets? Kalshi wants to make an example of you.

Trump says U.S. secretly moved more than 100 million barrels of oil through Strait of Hormuz

The big question facing SpaceX investors: What are you really buying?