When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Haizhou Ge, Yufei Jia, Yue Li, Zhixing Chen, Lu Shi, Lei Han, Guyue Zhou, Ruqi Huang 1 min read

Key Points

Announce Type: new Abstract: Exploratory manipulation often turns an apparent failed attempt into the key evidence for what to do next. For example, a robot pulls a locked cabinet drawer, fails, and only succeeds after opening the lock. The failed pull reveals a latent precondition (the drawer is locked) that determines the minimal-success action chain (the fewest actions that complete the task), here [lock-open, drawer-pull].

arXiv:2606.08542v1 Announce Type: new Abstract: Exploratory manipulation often turns an apparent failed attempt into the key evidence for what to do next. For example, a robot pulls a locked cabinet drawer, fails, and only succeeds after opening the lock. The failed pull reveals a latent precondition (the drawer is locked) that determines the minimal-success action chain (the fewest actions that complete the task), here [lock-open, drawer-pull]. Correctly reading this trace is therefore the prerequisite for recovering that chain. We formalize this setting as Exploratory Manipulation Trace QA (EMT-QA): given synchronized video and proprioception from an exploratory trace, predict the minimal-success action chain under the latent precondition revealed by the probe. However, even state-of-the-art VLMs and embodied multimodal LLMs misread this evidence: they do not reliably recover the chain from raw video, raw proprioception, or their combination. We introduce Closed-Loop Trace Distillation, a pipeline that uses a per-task coding agent to inspect labeled training traces and distill a one-line natural-language prompt over the trace, which we call the Distilled Reading Heuristic (DRH). At inference, no agent is invoked and no model weights are updated; a frozen VLM receives the raw trace plus the DRH as a prompt entry. Across three simulator and two real-robot tasks, the DRH improves chain accuracy by +0.38 to +0.47 over the best raw-modality baseline. The same DRH also serves as the sole specification for one-shot programmatic classifiers that match the prompted VLM.

Exploratory Manipulation Trace QA (ORG) EMT-QA (ORG) the Distilled Reading Heuristic (DRH (PERSON) VLM (ORG) DRH (ORG)

Originally published by arXiv CS Read original →

When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA

Related Stories

China Opens World's First Wind-Powered Underwater Data Center

What that tiny green dot on your Samsung phone is telling you

I Hate (Most) Keyboard 'Fn' Keys

KKR’s Agrawal Says Market May Be Underestimating AI Power Needs