Home › Knowledge Base › Multimodal Generative Retrieval

Multimodal Generative Retrieval

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Closing the Indexing-Decoding Gap in Multimodal Generative Retrieval via Prefix Retention Optimization

arXiv:2606.09241v1 Announce Type: new Abstract: Multimodal generative retrieval formulates multimodal retrieval as discrete identifier generation, eliminating the need for explicit similarity search over external embeddings. Existing approaches construct identifiers via residual quantization and decode them with trie-constrained beam search. This combination introduces an indexing-decoding gap: identifier learning objectives, including reconstruction and contrastive losses, do not explicitly...

arXiv CS 1d ago

Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation

arXiv:2510.24870v2 Announce Type: replace Abstract: We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal settings.

arXiv CS 8d ago

MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A

arXiv:2606.04231v1 Announce Type: new Abstract: Recent advances in multimodal retrieval-augmented generation (MM-RAG) have shifted toward minimal parsing, relying on page-level images for producing retriever embeddings and for answer generation. While efficient, this trend often neglects explicit handling of the rich, structured information in complex enterprise documents, instead depending on pre-trained embeddings or vision-language models to implicitly capture such structure. In this...

arXiv CS 6d ago

Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

Announce Type: new Abstract: Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary...

arXiv CS 6d ago

Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation

Announce Type: new Abstract: This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via MultimodAl Retrieval (MAGMaR). Addressing the critical challenges of cross-lingual long-video comprehension, strict persona adherence, and zero-hallucination temporal grounding, we propose a fully training-free, two-stage cascaded Video RAG pipeline. Our architecture strategically decouples semantic retrieval from cognitive logical reasoning through a...

arXiv CS 1d ago

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

Announce Type: replace Abstract: Vision-Language Models (VLMs) integrate visual and textual knowledge into unified representations that increasingly underpin modern retrieval and recommendation systems. However, it remains unclear how reliably these models utilize their cross-modal knowledge when ranking multimodal items, and whether their knowledge grounding can be subverted. In this paper, we expose a fundamental vulnerability in how VLMs apply multimodal knowledge for product ranking:...

arXiv CS 1d ago

Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions

arXiv:2604.08304v3 Announce Type: replace Abstract: Retrieval-augmented generation (RAG) extends large language models (LLMs) with external knowledge, but this access path also introduces security risks that existing work often conflates with inherent LLM flaws. We frame secure RAG as securing external knowledge access and organize the literature with SLOT, a taxonomy along four axes: the attack Surface (S) where an adversary acts, the defense Layer (L) that controls the same point, the...

arXiv CS 1d ago

MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval

Announce Type: replace Abstract: Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model...

arXiv CS 2d ago

Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search

arXiv:2605.30917v1 Announce Type: new Abstract: As large-scale visual-document corpora such as arXiv papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual documents to serve queries without neural encoding at scale. Existing methods either achieve strong retrieval quality with VLM-based dense or multi-vector models but require neural query encoding at serving time, or avoid...

arXiv CS 9d ago

Constrained Dominant Sets for Multimodal Document Question Answering

arXiv:2606.07252v1 Announce Type: new Abstract: Long multimodal document question answering is limited by which evidence reaches the reader, rather than by the quantity retrieved. In lengthy documents, findings often recur across figures, captions, and introductory sentences, causing similarity based retrievers in modern multimodal retrieval-augmented generation (RAG) systems to allocate resources to near-duplicates while overlooking complementary evidence. This work introduces a retriever...

arXiv CS 2d ago