Home Knowledge Base LVLMs

LVLMs

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Self-Improving Small Object Grounding in LVLMs

new Abstract: Can internal attention patterns in Large Vision Language Models (LVLMs) identify reliable small-object boxes without fine-tuning? In this work, we provide an affirmative answer. Attention structure in LVLMs encodes grounding quality-a lightweight IoU regressor trained solely on attention maps achieves strong IoU prediction (Pearson r > 0.67).

arXiv CS 8d ago

Disentangling Visual and Factual Correctness in LVLMs' Visualization Literacy

arXiv:2606.03142v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) show strong visualization interpretation, yet it is unclear whether their responses reflect genuine reasoning over visual evidence or factual priors learned during training. Current evaluations mix these two sources, obscuring when correct visual interpretation is overridden by memorized facts. We present a framework that isolates visual correctness from factual correctness, revealing validity limitations in...

arXiv CS 7d ago

Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes

arXiv:2512.14177v3 Announce Type: replace Abstract: Large Vision-Language Models (LVLMs) often produce plausible but unreliable outputs, making robust uncertainty estimation essential. Recent work on semantic uncertainty estimates relies on external models to cluster multiple sampled responses and measure their semantic consistency. However, these clustering methods are often fragile, highly sensitive to minor phrasing variations, and can incorrectly group or separate semantically similar...

arXiv CS 6d ago

What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

arXiv:2605.30911v1 Announce Type: new Abstract: Hallucination remains one of the key challenges undermining the reliability of Large Vision-Language Models (LVLMs). But what makes an LVLM hallucinate less? Many existing efforts focus on improving internal components of the model.

arXiv CS 9d ago

Zero-Shot Learning in Industrial Scenarios: New Large-Scale Benchmark, Challenges and Baseline

Announce Type: new Abstract: Large Visual Language Models (LVLMs) have achieved remarkable success in vision tasks. However, the significant differences between industrial and natural scenes make applying LVLMs challenging. Existing LVLMs rely on user-provided prompts to segment objects.

arXiv CS 1d ago

Cross-Modal Attention Calibration for LVLM Hallucination Mitigation

arXiv:2501.01926v3 Announce Type: replace Abstract: Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding. Despite their success, LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content. To address this issue, some approaches have introduced inference-time interventions, such as contrastive decoding, to reduce overreliance on language priors.

arXiv CS 9d ago

MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models

arXiv:2606.06760v1 Announce Type: new Abstract: Medical large vision-language models (Med-LVLMs) have recently achieved remarkable progress in vision-language comprehension and medical image segmentation. However, existing models still struggle to unify these two capabilities, which is essential for achieving clinically reasoning that connects visual findings with semantic interpretation. We present MedSIGHT, a unified framework that equips Med-LVLMs with structured, pixel-level...

arXiv CS 2d ago

Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

arXiv:2606.07000v1 Announce Type: new Abstract: Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-Language Models (LVLMs). However, the sparse nature of verifiable rewards provides little token-level supervision for failed rollouts, often leading to inefficient exploration in complex multimodal reasoning tasks.

arXiv CS 2d ago

YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models

arXiv:2605.31429v1 Announce Type: new Abstract: Contrastive decoding (CD) seeks to mitigate hallucinations in Large Vision-Language Models (LVLMs) by contrasting the output distributions of a standard model and a visually degraded model. However, existing training-free CD methods suffer from sub-optimal degraded branches: completely dropping visual tokens is too extreme and induces language hallucinations, while corrupting input images offers coarse control over visual evidence and suffers...

arXiv CS 9d ago

Topology-Aware Layer Pruning for Large Vision-Language Models

arXiv:2604.16502v2 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning, while recent extensions that incorporate visual inputs enable them to process multimodal information. Despite these advances, Large Vision-Language Models (LVLMs) incur substantial computational and memory costs, hindering deployment in resource-constrained scenarios. Existing layer pruning methods typically rely on local...

arXiv CS 5d ago