Home › Knowledge Base › Omni

Omni

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

arXiv:2606.09331v1 Announce Type: new Abstract: Omni-modal retrieval promises a single embedding space for text, image, video, document, and audio inputs, but building such a unified retriever is difficult since these modalities differ in data distribution, architecture, and optimization dynamics. In this work, we present Conan-embedding-v3, a decouple--fuse--recover framework for omni-modal retrieval. Conan-embedding-v3 first trains modality specialists independently and fuses their task...

arXiv CS 1d ago

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

arXiv:2606.08572v1 Announce Type: new Abstract: While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this...

arXiv CS 1d ago

OmniHalluc-L: Counterfactual Benchmarking and Modality-Perturbation Reliability Calibration for Long-Form Omni Hallucination

Announce Type: new Abstract: Long-video Omni assistants often fail not by inventing content, but by misbinding real evidence: they hear the right utterance and see the right event, yet attach it to the wrong speaker, moment, or modality. These \emph{almost-true} errors evade standard video QA because local evidence remains valid, so item-level scoring can reward both a supported claim and its near-counterfactual. We introduce a counterfactual event-binding protocol that constructs paired...

arXiv CS 7d ago

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

arXiv:2606.05177v1 Announce Type: new Abstract: Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment. Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity.

arXiv CS 5d ago

Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation

Announce Type: new Abstract: Recent unified audio generation models can support diverse tasks across speech, sound effects, and music, but most of them still focus on isolated task-level synthesis. However, real video production often requires multiple components of a complete audio track to be generated jointly and consistently for the same video. We present Foley-Omni, a unified multimodal audio generation model that extends isolated task-level synthesis to complete video soundtrack...

arXiv CS 7d ago

MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models

arXiv:2606.04349v2 Announce Type: replace Abstract: Conventional Post-Training Quantization (PTQ) methods struggle with 4-bit Omni-modal Large Language Models (OLLMs) due to the extreme distribution heterogeneity and disparate outlier patterns across modalities. To address this, we propose MorphoQuant, a modality-aware PTQ framework engineered to preserve cross-modal morphology and mitigate outlier loss. Specifically, we introduce Distribution-Aware Bias Compensation (DABC), which...

arXiv CS 2d ago

MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models

Announce Type: new Abstract: Conventional Post-Training Quantization (PTQ) methods struggle with 4-bit Omni-modal Large Language Models (OLLMs) due to the extreme distribution heterogeneity and disparate outlier patterns across modalities. To address this, we propose MorphoQuant, a modality-aware PTQ framework engineered to preserve cross-modal morphology and mitigate outlier loss. Specifically, we introduce Distribution-Aware Bias Compensation (DABC), which selectively absorbs long-tailed...

arXiv CS 6d ago

jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

Announce Type: replace Abstract: In this work, we introduce GELATO (Geometry-preserving Embeddings via Locked Aligned TOwers), a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single...

arXiv CS 1d ago

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

arXiv:2604.18360v2 Announce Type: replace Abstract: Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment of practical retrieval robustness. We present Omni-Embed-Audio (OEA), a retrieval-oriented encoder leveraging multimodal LLMs with native audio understanding. To...

arXiv CS 8d ago

Omni-Supervised Motion Editing: Balancing Change and Invariance through Positive-Negative Learning

Announce Type: new Abstract: Text-based human motion editing aims to modify existing motion sequences according to natural language instructions while maintaining the consistency of the original motion. Existing diffusion-based approaches often rely on heuristic similarity cues or coarse global conditioning, leading to motion distortion and suboptimal semantic alignment. The key challenge lies in balancing change (i.e. precisely editing target regions) and invariance (i.e. preserving...

arXiv CS 9d ago