MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

arXiv CS Monday 08 June 2026, 04:00 UTC By Wayner Barrios, Andr\'es Villa, Juan Le\'on Alc\'azar, SouYoung Jin, Bernard Ghanem 1 min read

Key Points

arXiv:2506.01850v2 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle with fine-grained visual grounding due to semantic entanglement in visual patch representations, where individual patches blend multiple distinct visual elements, making it difficult for models to focus on instruction-relevant details. To address this challenge, we propose MoDA (Modulation Adapter), a lightweight module that enhances visual grounding through instruction-guided channel-wise modulation. Unlike token-level methods such as Q-Former that perform additive feature selection, MoDA operates at the channel level through multiplicative modulation on already-aligned features, enabling fine-grained control over which embedding dimensions are relevant for each instruction. Following the standard LLaVA training protocol, MoDA applies cross-attention between language instructions and pre-aligned visual features, generating dynamic modulation masks without architectural modifications or additional supervision. We evaluate MoDA across 12 benchmarks spanning visual question answering, vision-centric reasoning, and hallucination detection, including recent 2024 benchmarks (MMVP, CV-Bench, MMStar, RealWorldQA), on three distinct MLLM architectures: LLaVA-1.5, LLaVA-MoRE (2025), and Qwen3-VL (2025). MoDA delivers consistent gains across all three families, with +12.0 points on MMVP for the LLaVA-1.5 family and +4.8 points on ScienceQA for the LLaVA-MoRE family, and +4.9 ScienceQA, +4.1 RealWorldQA, and +3.8 GQA on Qwen3-VL, confirming that the gains generalize beyond CLIP-based encoders with minimal overhead (<1% FLOPs). Code is available at https://github.com/waybarrios/MoDA.

Modulation Adapter (ORG) MMVP (ORG) CV-Bench (ORG) MMStar (ORG) Qwen3-VL (PERSON) CLIP (ORG)

Originally published by arXiv CS Read original →

MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

Related Stories

Popular UK seaside town hotel plunges into administration as holidaymakers updated

Scientists were excited about a blood test for many cancers — but it failed a big trial. Here's what to know.

After NSIL’s PPP bid, IN-SPACe opens LVM-3 to private sector with ToT push

NASA chief defends all-male Artemis 3 astronaut crew amid backlash: 'I don't think anyone should be reading into this'