Home › Knowledge Base › Cross-Modality

Cross-Modality

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

CR-JEPA: Cross-Modal Joint-Embedding Predictive Learning for Remote Sensing Image Retrieval

arXiv:2606.00706v2 Announce Type: replace Abstract: Cross-modal remote sensing image retrieval aims to retrieve semantically related scenes across heterogeneous sensing modalities. This remains challenging because paired observations may differ substantially in imaging physics, spatial resolution, spectral configuration, and visual appearance. Moreover, a single retrieval projection trained with one objective may be insufficient to jointly support cross-modal semantic alignment and...

arXiv CS 1d ago

PRISM: Topology-Aware Cross-Modal Imputation for Modality-Deficient Federated Graph Learning

Announce Type: new Abstract: Multimodal federated graph learning (MM-FGL) aims to collaboratively learn from decentralized graphs with text and images. However, real-world clients may not share a common modality basis: a visual-search client may contain image--interaction graphs but no seller descriptions, while a catalog client may provide text but no product images. We refer to this practical setting as client-level modality deficiency.

arXiv CS 1d ago

VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

Announce Type: replace Abstract: Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Meanwhile, existing cross-modal methods predominantly rely on textual modalities, leaving the spatial pattern recognition capabilities of vision models underexplored for time series analysis. To address these limitations, we propose VFEM, a cross-modal forecasting model that leverages...

arXiv CS 1d ago

Variational Adapter for Cross-modal Similarity Representation

arXiv:2605.30968v1 Announce Type: new Abstract: The core of vision-language models lies in measuring cross-modal similarity within a unified representation space. However, most image-text matching or multi-class image classification datasets lack fine-grained cross-modal matching annotations, forcing the continuous similarity space into binary classification boundaries. This compression induces false negative samples and significantly impairs the generalization performance of cross-modal tasks.

arXiv CS 9d ago

A Sonar-Visual Dataset for Cross-Modal Underwater Robot Perception

arXiv:2606.01398v1 Announce Type: new Abstract: Underwater robots typically use both cameras and sonar for perception to leverage the rich semantic details of vision and the robust range measurements of acoustics. However, learning to map between these modalities via cross-modal prediction remains underexplored due to limited sonar-visual paired datasets. We present SOVIS, a sonar-visual dataset for cross-modal underwater perception.

arXiv CS 8d ago

Benign Inputs, Harmful Outputs: Cross-Modal Jailbreaking via Distributed Semantic Recomposition

arXiv:2606.01837v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in content synthesis and autonomous reasoning. Previous safety guardrails are primarily designed for unimodal textual input interception, leaving them vulnerable to cross-modal jailbreak attacks. However, regardless unimodal textual attack or cross-modal jailbreak, typically inclusive part of explicit harmful or sensitive content at the input level,...

arXiv CS 8d ago

Text-guided Feature Disentanglement for Cross-modal Gait Recognition

Announce Type: new Abstract: Gait recognition is a biometric technique that identifies individuals based on their walking patterns, offering advantages in long-range, non-intrusive scenarios. However, real-world scenarios often involve heterogeneous sensing modalities such as LiDAR and RGB cameras, making LiDAR-Camera Cross-modal Gait recognition (LCCGR) a critical yet challenging task due to the substantial modality gap between 2D videos and 3D point cloud sequences. To address this...

arXiv CS 9d ago

HQ-JEPA: Hybrid Quantum Joint-Embedding Predictive Architecture for Cross-Modal Remote Sensing Representation Learning

arXiv:2605.31068v1 Announce Type: new Abstract: We introduce HQ-JEPA, a hybrid quantum-classical joint-embedding predictive architecture for cross-modal remote sensing representation learning. The proposed framework extends JEPA-style masked latent prediction to paired Sentinel-1 and Sentinel-2 imagery by predicting masked target representations from visible context regions while aligning heterogeneous modality features in a shared embedding space. To improve representation quality, HQ-JEPA...

arXiv CS 9d ago

RankByGene: Gene-Guided Histopathology Representation Learning Through Cross-Modal Ranking Consistency

arXiv:2411.15076v3 Announce Type: replace-cross Abstract: Spatial transcriptomics (ST) provides essential spatial context by mapping gene expression within tissue, enabling detailed study of cellular heterogeneity and tissue organization. However, aligning ST data with histology images poses challenges due to inherent spatial distortions and modality-specific variations. Existing methods largely rely on direct alignment, which often fails to capture complex cross-modal relationships.

arXiv CS 8d ago

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

Announce Type: new Abstract: The Transformer's quadratic complexity with input length imposes an unsustainable computational load on large language models (LLMs). In contrast, the Selective Scan Structured State-Space Model, or Mamba, addresses this computational challenge effectively. This paper explores a query-based cross-modal projector designed to bolster Mamba's efficiency for vision-language modeling by compressing visual tokens based on input through the cross-attention mechanism.

arXiv CS 6d ago