Home › Knowledge Base › Vision On-Policy Distillation

Vision On-Policy Distillation

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

arXiv:2510.23497v3 Announce Type: replace Abstract: Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models.

arXiv CS 5d ago

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

arXiv:2605.18740v4 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant...

arXiv CS 7d ago

Thinking Without Images: Internalizing Visual Manipulation with On-Policy Self-Distillation

arXiv:2606.08719v1 Announce Type: new Abstract: ''Thinking with Images'' has emerged as an effective paradigm for fine-grained visual reasoning: by explicitly zooming into relevant regions and reasoning over crops, models can access local evidence that is difficult to recover from a single global image. However, this benefit comes with redundant tool invocations and longer inference traces. Moreover, when such behaviors are learned mainly from outcome reward, the resulting intermediate crops...

arXiv CS 1d ago

Stage-1 Controls the Entropy Regime, Not the Outcome

arXiv:2606.09059v1 Announce Type: new Abstract: Two-stage post-training -- a Stage-1 warm-start (supervised fine-tuning, SFT, or on-policy distillation, OPD) followed by Stage-2 reinforcement learning (RL) -- is increasingly used for vision-language models (VLMs). We ask what Stage-1 actually controls in a small-data study using Qwen2.5-VL-7B with a same-modality 72B VLM teacher for OPD. First, the three warm-starts reach a narrow $53$--$54\%$ band on Geometry3K internal validation,...

arXiv CS 1d ago

Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

Announce Type: new Abstract: While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and...

arXiv CS 5d ago

Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

Announce Type: replace Abstract: While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery...

arXiv CS 1d ago

Launch HN: General Instinct (YC P26) – Frontier models on edge devices

Hey HN, Guanming and Bill here from General Instinct (https://general-instinct.com/).After years of working in robotics, we kept running into the same problem: the best models never fit the hardware we actually had available. The models that performed best were usually designed around datacenter assumptions: large GPUs, lots of memory bandwidth, and reliable network access. But most physical systems have the opposite constraints.

Hacker News 5d ago