Home › Knowledge Base › Perception Module

Perception Module

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Learned Non-Maximum Suppression for 3D Object Detection

arXiv:2606.03568v1 Announce Type: new Abstract: Post-processing is a critical stage in LiDAR-based 3D object detection, where dense and overlapping proposals must be filtered for compact and reliable perception. This work introduces two learned filtering modules that replace heuristic non-maximum suppression (NMS) by leveraging relations among detections. D2D-Rescore employs transformer-based detection-to-detection (D2D) attention, while GossipNet3D adapts the 2D GossipNet concept to 3D...

arXiv CS 7d ago

Resonant Minds: Closed-Loop Social Avatars with Theory of Mind

new Abstract: Creating lifelike digital humans with genuine social intelligence requires unifying cognitive reasoning and multimodal generation within a coherent framework. Current approaches treat these as separate tasks: Large Language Models excel at dialogue but lack embodied expression, while diffusion-based talking head models achieve visual fidelity but ignore social cognition.

arXiv CS 5d ago

3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training

Announce Type: new Abstract: We propose a 3D-thinking-guided co-training framework that enables vision-language-action (VLA) models to perform 3D spatial reasoning implicitly during action prediction. Our core insight is that 3D geometry perception and 3D spatial reasoning are distinct capabilities that can be disentangled and injected at different feature hierarchies. During training, three tightly coupled components work in concert primarily within the latent space: (1) To gain geometric...

arXiv CS 6d ago

CoSTL: Comprehensive Spatial-Temporal Representation Learning for Moment Retrieval and Highlight Detection

arXiv:2606.01149v1 Announce Type: new Abstract: Video Moment Retrieval (MR) and Highlight Detection (HD) are crucial tasks in video analysis that aim to localize specific moments and estimate clip-wise relevance based on a given text query. Recent approaches treat them as similar video grounding tasks and use the same architecture to solve them. These tasks require both fine-grained comprehension at the image level and high-level temporal understanding across the entire video.

arXiv CS 8d ago

FMRFusion: Frequency-Aware Multi-View Representation Learning for Heterogeneous Image Fusion

Announce Type: new Abstract: Infrared and visible image fusion aims to generate a composite image that retains significant target information and preserves detailed textures, integrating two heterogeneous modalities. Previous image fusion methods typically adopt a single-module stacking approach to extract features from the two modalities. However, these approaches may result in incomplete learning of their distinct characteristics, thereby limiting the fusion effectiveness and constrain ing...

arXiv CS 1d ago

Edge-Aware and Content-Adaptive Infrared Gas Leak Detection for Industrial Safety Monitoring

Announce Type: replace Abstract: Infrared gas leak detection is important for industrial safety and environmental monitoring, but automatic detection remains challenging because gas plumes are often faint, small, semi-transparent, and weakly bounded. This paper proposes an Edge-Aware and Content-Adaptive Feature Fusion Detector (ECAF-Det) for weak-plume detection in cluttered thermal scenes. ECAF-Det integrates three task-oriented designs: a plume-oriented local-global feature enhancement...

arXiv CS 7d ago

DifferSeg: Towards Diverse Multimodal Binary Segmentation via Differential Perception and Frequency Guidance

Announce Type: new Abstract: In many binary segmentation tasks, most multimodal methods rely on fixed feature concatenation for cross-modal interaction and straightforward decoder designs dominated by low-frequency semantics. However, they ignore two key challenges: one is the lack of an adaptive mechanism to handle modality discrepancies and complementarity, and the other is the absence of an efficient decoding strategy to balance both high- and low-frequency representations.

arXiv CS 1d ago

Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

arXiv:2606.09303v1 Announce Type: new Abstract: The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal large language models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often constrained by limited training data and the gap between MLLMs and mask generation modules.

arXiv CS 1d ago

MUSE: A Unified Agentic Harness for MLLMs

Announce Type: new Abstract: Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a grid maze from a screenshot or selecting the correct puzzle piece. Rather than retraining the model, we ask a complementary question: how much capability can be elicited from a frozen MLLM purely by improving the execution scaffold around it? We introduce MUSE, a multimodal unified structured execution harness that wraps any...

arXiv CS 7d ago

HDRAgent: An Agentic Framework for Multi-Exposure HDR Imaging

Announce Type: new Abstract: Most existing multi-exposure HDR methods follow a fixed feed-forward reconstruction paradigm, making them prone to ghosting artifacts in complex dynamic scenes. To address this issue, we propose HDRAgent, the first agent-driven framework for HDR imaging, which adaptively selects reconstruction strategies according to the current scene conditions. Specifically, to provide scene-specific prior knowledge, we introduce a fine-grained contextual knowledge matching...

arXiv CS 1d ago