Home › Knowledge Base › Visual Foundation Model

Visual Foundation Model

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

VGGSounder: Audio-Visual Evaluations for Foundation Models

Announce Type: replace Abstract: The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities.

arXiv CS 6d ago

Attention Consistent Longitudinal Medical Visual Question Answering Guided by Vision Foundation Models

Announce Type: cross Abstract: Longitudinal medical visual question answering (VQA) requires reasoning about anatomical differences between an image of a current time point and an image of a referred time point. We propose an attention-guided encoder-decoder for this task with chest X-rays. Instead of conventional direct contrast, we propose to include a lightweight affine registration module to reduce nuisance motion by co-registering the current image to the reference image with a small...

arXiv CS 2d ago

Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models

Announce Type: new Abstract: Foundation models have driven rapid progress in computer vision, yet the two dominant paradigms, vision-language foundation models (VLMs) and vision-only foundation models (VFMs), remain only partially compatible. VLMs offer language-grounded semantic alignment but are often visually coarse, while VFMs learn discriminative perceptual geometry but lack semantic grounding. We propose GPUA (Geometry-Preserving Unsupervised Alignment), a framework that integrates the...

arXiv CS 6d ago

Attend to Anything: Foundation Model for Unified Human Attention Modeling

arXiv:2606.03540v1 Announce Type: new Abstract: Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generalize in real-world applications. To address the fundamental limitations, we present the Attend to Anything Model (AAM), a multi-modal foundation model...

arXiv CS 7d ago

Enhancing the Socioeconomic Understanding of Foundation Models with Urban Mobility

Announce Type: new Abstract: Foundation models have recently been applied to urban socioeconomic prediction using POI text, satellite imagery, and geospatial descriptions. However, these models mostly rely on static attributes of individual places, while ignoring the mobility patterns that reveal how places are functionally connected. To address this gap, we explore whether mobility networks can elicit the geospatial capabilities of foundation models by explicitly encoding connectivity among...

arXiv CS 8d ago

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

arXiv:2606.08952v1 Announce Type: new Abstract: Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models.

arXiv CS 1d ago

NewtPhys: Do Foundation Models Understand Newtonian Physics?

arXiv:2606.03986v1 Announce Type: new Abstract: Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding.

arXiv CS 7d ago

PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization

Announce Type: new Abstract: Unifying the complementary strengths of diverse Vision Foundation Models (VFMs) into a single efficient model is highly desirable but challenged by the negative transfer inherent in monolithic distillation. To address these feature conflicts, we introduce \textbf{PRISM}, a novel dual-stream Mixture-of-Experts (MoE) framework that synergizes VFMs via modular specialization.

arXiv CS 7d ago

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

Announce Type: new Abstract: Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge.

arXiv CS 8d ago

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

Announce Type: new Abstract: Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image -- and TVRBench, an indoor-simulation benchmark spanning scene...

arXiv CS 8d ago