Home Knowledge Base Fine-Grained Visual

Fine-Grained Visual

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

arXiv:2506.01850v2 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle with fine-grained visual grounding due to semantic entanglement in visual patch representations, where individual patches blend multiple distinct visual elements, making it difficult for models to focus on...

arXiv CS 2d ago

HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps

arXiv:2601.02730v3 Announce Type: replace Abstract: Visual localization on standard-definition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and...

arXiv CS 5d ago

Learning from Fine-Grained Visual Discrepancies: Mitigating Multimodal Hallucinations via In-Context Visual Contrastive Optimization

arXiv:2605.31312v1 Announce Type: new Abstract: Multimodal hallucination remains a persistent challenge for Vision-Language Models (VLMs). Standard textual Direct Preference Optimization (DPO) often fails to mitigate it due to a lack of explicit visual supervision. While existing works introduce visual preference DPO by contrasting original images against negative ones, they suffer from a theoretically inconsistent objective caused by partition function mismatches and rely on coarse-grained...

arXiv CS 9d ago

LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models

arXiv:2603.00171v3 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing interest in lightweight, training-free solutions. However, existing training-free methods suffer from two flaws: perceptual redundancy from indiscriminate cropping, which increases computational cost and introduces noise; and a...

arXiv CS 8d ago

PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification

arXiv:2602.07768v3 Announce Type: replace Abstract: Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer.

arXiv CS 7d ago

EvoMemNav: Efficient Self-Evolving Fine-Grained Memory for Zero-Shot Embodied Navigation

arXiv:2606.03509v1 Announce Type: new Abstract: Building memory is essential for long-horizon planning in zero-shot embodied navigation. Detector-centric scene graphs often compress observations into sparse nodes, discarding fine-grained visual evidence and accumulating noise, while 3D reconstruction-based methods remain computationally prohibitive. We present EvoMemNav, an efficient, self-evolving, fine-grained memory framework for zero-shot embodied navigation.

arXiv CS 7d ago

ToolFG: Towards Well-Grounded Fine-Grained Image Classification

arXiv:2606.02518v1 Announce Type: new Abstract: Fine-grained image classification (FGIC) has broad applications and has attracted significant research attention. In this paper, we explore a novel paradigm for solving FGIC by proposing \textbf{ToolFG}, the first tool-integrated MLLM-based framework tailored to FGIC. ToolFG enables MLLMs to autonomously and flexibly use external tools during the reasoning process, actively interact with images, and collect verifiable visual cues for...

arXiv CS 8d ago

Hierarchical Mask-Enhanced Dual Reconstruction Network for Few-Shot Fine-Grained Image Classification

arXiv:2506.20263v2 Announce Type: replace Abstract: Few-shot fine-grained image classification (FS-FGIC) is challenging as it requires distinguishing visually similar subclasses with extremely limited labeled examples. Existing methods suffer from critical limitations: metric-based methods lose spatial information and misalign local features, while reconstruction-based methods underuse hierarchical feature information and lack selective focus on discriminative key regions. We propose the...

arXiv CS 5d ago

Dual Feature Decoupling for Fine-Grained OOD Detection

arXiv:2606.05536v1 Announce Type: new Abstract: Out-of-distribution detection (OOD) is an indispensable technique when applying machine learning models to real-world scenarios. Most existing OOD detection methods have been developed under the idealized assumption of large inter-class distributional differences, while largely overlooking fine-grained tasks characterized by subtle variations, such as medical image classification and vehicle recognition. The high visual similarity among...

arXiv CS 5d ago

Concept-wise Attention for Fine-grained Concept Bottleneck Models

Announce Type: replace Abstract: Recently impressive performance has been achieved in Concept Bottleneck Models (CBM) by utilizing the image-text alignment learned by a large pre-trained vision-language model (i.e. CLIP). However, there exist two key limitations in concept modeling. Existing methods often suffer from pre-training biases, manifested as granularity misalignment or reliance on structural priors.

arXiv CS 7d ago