Home › Knowledge Base › Large Multimodal Models

Large Multimodal Models

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Advancing Creative Physical Intelligence in Large Multimodal Models

Announce Type: replace Abstract: Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative...

arXiv CS 9d ago

NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis

arXiv:2606.08948v1 Announce Type: new Abstract: Comprehensive estimation of dietary micronutrients from food images could improve clinical nutrition care, but training such models requires large multimodal datasets linking diverse foods to complete nutrient profiles. We first show that existing multimodal large language models (MLLMs), including leading proprietary models, are unreliable for this task. Across five model families and four independent evaluation benchmarks (ASA24, SNAPMe,...

arXiv CS 1d ago

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial...

arXiv CS 5d ago

Enhancing Video Representations with Spatiotemporal-Semantic Residual to Mitigate Hallucinations in Video Large Multimodal Models

arXiv:2601.22574v2 Announce Type: replace Abstract: Although Video Large Multimodal Models have achieved strong performance in video understanding, they still suffer from hallucination. Existing inference-time intervention methods usually modify videos under the contrastive decoding framework, but their heuristic designs bring limited improvements and increase inference latency. To address these issues, we propose ViSSRes, an inference-time intervention method that enhances video...

arXiv CS 2d ago

ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

arXiv:2605.31251v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through...

arXiv CS 9d ago

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

Announce Type: new Abstract: Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer options can attract the model's decision and make the newly added option likely to be selected.

arXiv CS 8d ago

LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models

arXiv:2603.00171v3 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing interest in lightweight, training-free solutions. However, existing training-free methods suffer from two flaws: perceptual redundancy from indiscriminate cropping, which increases computational cost and introduces noise; and a...

arXiv CS 8d ago

InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

arXiv:2509.12263v3 Announce Type: replace Abstract: Large multimodal models (LMMs) encode physical laws observed during training, such as momentum conservation, as parametric knowledge. It allows LMMs to answer physical reasoning queries, such as the outcome of a potential collision event from visual input. However, since parametric knowledge includes only the physical laws seen during training, it is insufficient for reasoning in inference scenarios that follow physical laws unseen during...

arXiv CS 8d ago

Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

arXiv:2606.05874v1 Announce Type: new Abstract: Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios where multiple actions are equally valid, such as recommending travel itineraries or daily schedules where multiple options have similar utility. In such settings, deterministic policies may lead to repetitive...

arXiv CS 5d ago

When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models

arXiv:2606.08918v1 Announce Type: new Abstract: Worldwide image geo-localization aims to determine the capture location of an image on a global scale. Existing methods often mislocalize images by matching them to visually similar scenes from different geographic regions, which limits reliability in practical applications. To address this issue, we propose TransGeoCLIP, a novel retrieval-based framework that integrates a location attention mechanism and large multimodal models (LMMs).

arXiv CS 1d ago