Home › Knowledge Base › Learning Visual Spatial Planning

Learning Visual Spatial Planning

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

Announce Type: new Abstract: While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and...

arXiv CS 5d ago

Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

Announce Type: replace Abstract: While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery...

arXiv CS 1d ago

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

arXiv:2602.08236v2 Announce Type: replace Abstract: Despite rapid progress in MLLMs, visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood.

arXiv CS 8d ago

Design-MLLM: A Reinforcement Alignment Framework for Verifiable and Aesthetic Interior Design

Announce Type: replace Abstract: Interior design is a requirements-to-visual-plan generation process that must simultaneously satisfy verifiable spatial feasibility and comparative aesthetic preferences. While recent multimodal large language models (MLLMs) offer a unified foundation for interpreting user intent and producing design rationales, our empirical analysis reveals a persistent contradiction in real-world deployment: MLLMs often produce layouts that are unbuildable and...

arXiv CS 8d ago

Audio-Visual World Models: Grounding Multisensory Imagination for Embodied Agents

arXiv:2512.00883v3 Announce Type: replace Abstract: World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities.

arXiv CS 2d ago

Amplified Arctic iceberg traffic reshapes benthic biodiversity

Abstract The Arctic is undergoing rapid warming, resulting in retreating sea ice and glaciers1, yet how cryospheric changes propagate into the deep ocean remains poorly understood2. Here we identify a climate-driven mechanism linking accelerating glacier disintegration to an increase in deep-sea hard-bottom habitats far beyond calving fronts. Seafloor observations in Fram Strait show a localized increase in the density and patchiness of dropstones delivered by debris-laden icebergs.

Nature 20h ago

Embody4D: A Generalist Data Engine for Embodied 4D World Modeling

arXiv:2605.01799v2 Announce Type: replace Abstract: Embodied agents require robust and comprehensive 3D spatiotemporal representations to support spatial reasoning, manipulation understanding, and downstream decision making. However, existing robot data are typically captured from fixed or sparse viewpoints, providing only partial and view-dependent observations, which limits multi-view perception and generalization across viewpoints. Given the difficulty of collecting additional viewpoints...

arXiv CS 1d ago

Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

Announce Type: new Abstract: Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the \textbf{Hierarchical Semantic-Augmented...

arXiv CS 8d ago

Don't Fool Me Twice: Adapting to Adversity in the Wild with Experience-Driven Reasoning

arXiv:2605.31119v1 Announce Type: new Abstract: In robotics, dangers and adversity modes are often embodiment-specific and relative to each agent. A frontier of autonomous mobile robotics is to enable agents to operate effectively in the wild in unseen unstructured environments. A significant challenge in unseen unstructured environments is that it may not be possible to predict all the dangers to the specific robot.

arXiv CS 9d ago

A thalamus–brainstem attractor network drives history-biased decisions

Abstract Natural environments often change gradually, making it adaptive to bias decisions on the basis of the recent past — a phenomenon known as serial dependence1,2,3. Large-scale recordings during behaviour have identified that serial dependence is a common motif for decision-making, with neural representations of past experiences found throughout the brain4,5,6,7,8,9,10,11. However, it remains unclear whether this bias arises from dedicated neural circuits with history-specific...

Nature 20h ago