Home › Knowledge Base › MLLM

MLLM

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder

Announce Type: new Abstract: Aggressive distillation of the diffusion U-Net inverts the per-frame bottleneck of real-time text-to-image pipelines: once the denoiser is a 4-step or 1-step distilled student, the text encoder becomes the critical path. This inversion is most acute in vision-aware edit diffusion, where the encoder is a multimodal large language model (MLLM). We study the case of a 0.39B distilled edit U-Net paired with a 2.13B MLLM text encoder (Qwen3-VL) and present a streaming...

arXiv CS 5d ago

ES-Merging: Biological MLLM Merging via Embedding Space Signals

arXiv:2603.14405v2 Announce Type: replace Abstract: Biological multimodal large language models (MLLMs) have emerged as powerful foundation models for scientific discovery. However, existing models are specialized to a single modality, limiting their ability to solve inherently cross-modal scientific problems. While model merging is an efficient method to combine the different modalities into a unified MLLM, existing methods rely on input-agnostic parameter space heuristics that fail to...

arXiv CS 8d ago

SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM

arXiv:2511.14143v2 Announce Type: replace Abstract: Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a natural language query. Despite recent progress in moment retrieval from videos using both traditional techniques and Multimodal Large Language Models (MLLM), most existing methods still rely on coarse temporal understanding and a single visual modality, limiting performance on complex videos. To address...

arXiv CS 1d ago

PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

new Abstract: Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables...

arXiv CS 5d ago

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

Announce Type: new Abstract: Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text,...

arXiv CS 7d ago

Design-MLLM: A Reinforcement Alignment Framework for Verifiable and Aesthetic Interior Design

Announce Type: replace Abstract: Interior design is a requirements-to-visual-plan generation process that must simultaneously satisfy verifiable spatial feasibility and comparative aesthetic preferences. While recent multimodal large language models (MLLMs) offer a unified foundation for interpreting user intent and producing design rationales, our empirical analysis reveals a persistent contradiction in real-world deployment: MLLMs often produce layouts that are unbuildable and...

arXiv CS 8d ago

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

arXiv:2605.30931v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics.

arXiv CS 9d ago

Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching

arXiv:2512.03553v3 Announce Type: replace Abstract: Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases....

arXiv CS 6d ago

When Recovery Matters: The Blind Spot of Surrogate Privacy in MLLM Editing

arXiv:2606.07171v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) enable flexible instruction-driven image editing, but privacy risks arise when user images expose diverse and user-specific private content. Canonical privacy protection strategies typically substitute sensitive regions with surrogate content before cloud editing. Yet, the resulting output is often an edited surrogate rather than the desired edited source image, neglecting the local recovery in both...

arXiv CS 2d ago

Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

arXiv:2606.09091v1 Announce Type: new Abstract: On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinforcement learning with verifiable rewards (RLVR), which typically depends on sparse binary or outcome-based environmental feedback. However, naive token-level distillation can suffer from gradient instability, due to...

arXiv CS 1d ago