Home › Knowledge Base › Vision-Language Model Selection

Vision-Language Model Selection

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

arXiv:2606.07451v1 Announce Type: new Abstract: Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe.

arXiv CS 2d ago

An Effective Router for Vision-Language Model Selection

Announce Type: new Abstract: Vision-language models (VLMs) with varying performance and resource requirements are widely deployed, making it difficult for users to select the most appropriate one among numerous VLM candidates. Existing work reveals the performance paradox phenomenon in language models and focuses on routing methods to solve it. However, developing a router for VLM selection is still a critical yet challenging problem, which primarily faces: 1) lack of specialized data, 2)...

arXiv CS 1d ago

VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language Planning

arXiv:2606.07338v1 Announce Type: new Abstract: Vision-language driving models increasingly use reasoning supervision to bridge perception, prediction, and planning, but existing driving rationales are often free-form and expensive to generate with frontier models. We present VeriDrive, a framework for constructing planning-oriented, verifiable counterfactual supervision. VeriDrive converts driving reasoning into a structured Perception-Evaluation-Revision chain that grounds key objects in...

arXiv CS 2d ago

Stateful Visual Encoders for Vision-Language Models

arXiv:2606.04433v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the...

arXiv CS 6d ago

Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

Announce Type: new Abstract: Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the \textbf{Hierarchical Semantic-Augmented...

arXiv CS 8d ago

MAOAM: Unified Object and Material Selection with Vision-Language Models

arXiv:2606.04880v1 Announce Type: new Abstract: Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material.

arXiv CS 6d ago

Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models

Announce Type: new Abstract: Test-time compute (TTC) strategies have emerged as a lightweight approach to boost reasoning in large language models (LLMs). However, their application and benefits for vision-language models (VLMs) remain underexplored. We present a systematic study of TTC across seven VLMs and six benchmarks, specifically analyzing feature-based scoring and majority voting methods.

arXiv CS 9d ago

EasyLens: A Training-Free Plug-and-Play Subtle-Lesion Representation Amplifier for Medical Vision-Language Models

arXiv:2606.06379v1 Announce Type: new Abstract: Medical vision-language models (VLMs) have shown increasing potential for clinical image interpretation, including lesion detection and report generation. However, their practical utility remains limited by insufficient sensitivity to subtle lesions, whose visual evidence is often sparse, low-contrast, and embedded within complex anatomical context. As local visual tokens are aggregated, these weak lesion cues can become underrepresented in...

arXiv CS 5d ago

Interpretable Modeling of Driver Attention Shifts with a Vision-Language Model

Announce Type: replace Abstract: Driver gaze is commonly modeled as a spatial heatmap, but heatmaps alone are difficult for humans to interpret because they do not explain which road object or region is being monitored or why an attention shift may matter. This study examines whether minimal human-grounded supervision can steer a vision--language model toward interpretable descriptions of driver attention shifts. Using selected high-change gaze moments from the Berkeley DeepDrive-Attention...

arXiv CS 7d ago

DriveReward: A Comprehensive Dataset and Generative Vision-Language Reward Model for Autonomous Driving

Announce Type: new Abstract: Reward models play a pivotal role in reinforcement learning (RL) and multi-modal trajectory selection for autonomous driving. However, acquiring such rewards typically relies on hand-crafted rule-based objectives or perception ground truth, which hinders generalization for data-scaling. While Vision-Language Models (VLMs) have demonstrated feasibility as reward models in other domains, their effectiveness in driving tasks remains underexplored.

arXiv CS 1d ago