Home › Knowledge Base › Vision-Language Foundation Models

Vision-Language Foundation Models

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

EvoPrompt: Guided Prompt Evolution for Vision-Language Models Adaptation

arXiv:2603.09493v2 Announce Type: replace Abstract: The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge. While parameter-efficient prompt learning methods offer a promising path, they often suffer from catastrophic forgetting of pre-trained knowledge. Toward addressing this limitation, our work is grounded in the insight that governing the evolutionary path of prompts is essential for forgetting-free...

arXiv CS 6d ago

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

Announce Type: replace Abstract: Vision-Language Models (VLMs) integrate visual and textual knowledge into unified representations that increasingly underpin modern retrieval and recommendation systems. However, it remains unclear how reliably these models utilize their cross-modal knowledge when ranking multimodal items, and whether their knowledge grounding can be subverted. In this paper, we expose a fundamental vulnerability in how VLMs apply multimodal knowledge for product ranking:...

arXiv CS 1d ago

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

arXiv:2606.02580v1 Announce Type: new Abstract: Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models,...

arXiv CS 8d ago

Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study

arXiv:2606.03693v1 Announce Type: new Abstract: Medical Vision-Language Models (VLMs) are typically evaluated on English radiology visual question answering benchmarks, leaving their robustness under non-English clinical language largely unexplored. We introduce IndoRad-VQA, an Indonesian adaptation of VQA-RAD, to assess whether medical VLMs retain radiology reasoning ability when questions are asked in Bahasa Indonesia. Radiology question-answer pairs are translated into Indonesian with...

arXiv CS 7d ago

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

arXiv:2606.05531v1 Announce Type: new Abstract: Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar...

arXiv CS 5d ago

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

Announce Type: replace Abstract: Parameter-efficient adaptation of vision-language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous image-text alignment. This limitation is particularly critical in the clinic, where models should remain robust in low-data regimes and domain shifts.

arXiv CS 8d ago

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

arXiv:2606.01621v1 Announce Type: new Abstract: Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding.

arXiv CS 8d ago

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

new Abstract: Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not explain how their representations differ structurally. In this work, we study this problem through the task of Contrastive Embedding Clustering: identifying sample subsets that are weakly clustered under one representation but strongly clustered under...

arXiv CS 6d ago

Seeing is Believing? Evaluating Vision-Language Model Susceptibility in Agent-to-Agent Multimodal Persuasion

arXiv:2510.22768v2 Announce Type: replace Abstract: As autonomous agents increasingly interact, they inevitably attempt to influence one another. While prior work in text-only settings has explored the dynamics of Agent-to-Agent (A2A) persuasion, the rise of Vision-Language Models (VLMs) introduces a more complex challenge: multimodal content conveys richer information while integrating subtle, hard-to-detect persuasive cues. To study this vulnerability, we present MMPersuade, a unified...

arXiv CS 5d ago

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

Announce Type: new Abstract: Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations acquired by three model families: vision-only architectures (e.g., ViT), vision-language models (e.g., CLIP), and large-scale multimodal foundation models (e.g., LLaVA, Qwen, and Gemma). By evaluating across image clusters, including...

arXiv CS 2d ago