IoU
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Self-Improving Small Object Grounding in LVLMs
new Abstract: Can internal attention patterns in Large Vision Language Models (LVLMs) identify reliable small-object boxes without fine-tuning? In this work, we provide an affirmative answer. Attention structure in LVLMs encodes grounding quality-a lightweight IoU regressor trained solely on attention maps achieves strong IoU prediction (Pearson r > 0.67).
Training-Free Object-Agnostic Jam Detection in Fulfillment Centers
arXiv:2606.00321v2 Announce Type: replace Abstract: In fulfillment centers, diverse objects move continuously from inbound to outbound operations and can become jammed due to excessive conveyor friction, incorrect orientation, or mechanical failures. Traditional jam detection approaches rely on object detection models to identify objects, followed by tracking algorithms (such as IoU overlap and Kalman filtering) to monitor motion over time. This pipeline requires thousands of manual...
Redefining Instance Matching: A Unified Framework for Part-Aware Matching in Panoptic Segmentation Evaluation
Announce Type: new Abstract: The Panoptic Quality (PQ) metric is the standard for jointly evaluating instance and semantic segmentation. However, its original definition relies on a One-to-One matching between predicted and ground truth segments, which is only straightforward when the IoU threshold exceeds 0.5. Below 0.5, multiple matching strategies emerge in a poorly explored problem space.
Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records
Announce Type: new Abstract: Planning records define restrictions over geographic areas, but their source documents often provide only indirect spatial evidence rather than machine-readable boundaries. We introduce Plan2Map, a 208-case multimodal benchmark for document-grounded geospatial boundary reconstruction from UK planning records. Given only a source planning document, systems must reconstruct a valid geospatial boundary from notice text, schedules, map plates, map labels, and...
MS-DKC: A Dataset Knowledge Card Framework for Designing and Adapting Medical Image Segmentation Models
arXiv:2606.06103v1 Announce Type: new Abstract: Medical image segmentation is often framed as a search for stronger architectures, but this can obscure a more fundamental question: what does the dataset require from the model? In medical imaging, this requirement is shaped by foreground occupancy, morphology, boundary ambiguity, topology sensitivity, annotation quality, acquisition variation, and operating point. This paper introduces the Medical Segmentation Dataset Knowledge Card (MS-DKC),...
Cross-Domain Dead Tree Detection via Knowledge Distillation in Aerial Imagery
arXiv:2606.02303v1 Announce Type: new Abstract: Detecting dead trees in aerial imagery is vital for assessing forest health, especially as tree mortality increases globally due to climate change, but domain variability and scarce labeled data often limit model generalization. This study advances the TreeMort-1T-UNet (Tree Mortality 1-Task U-Net) model, initially trained on Finnish aerial imagery (source domain), by applying knowledge distillation (KD) to adapt it to various target domains,...
Hand Trajectory Fusion for Egocentric Natural Language Query Grounding
arXiv:2606.02962v1 Announce Type: new Abstract: Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes. We propose a hand-trajectory encoder for converting a sequence of...
ObjEmbed: Towards Universal Multimodal Object Embeddings
arXiv:2602.01753v3 Announce Type: replace Abstract: Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional...
Attention-Guided Autoencoder Fusion for Insulator Defect Detection Using UAV Transmission-Line Imaging
arXiv:2606.06536v1 Announce Type: new Abstract: Automated defect detection in high-voltage transmission-line insulators remains challenging due to severe class imbalance, large scale variation, and the small spatial extent of defect instances in Unmanned Aerial Vehicle (UAV) imagery. To address these challenges, this paper proposes AE-YOLO, an Attention-Guided AutoEncoder-Enhanced YOLO framework for robust insulator defect detection. The architecture integrates lightweight bottleneck...
SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation
arXiv:2604.20395v2 Announce Type: replace Abstract: Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs in 0.12--0.30 seconds per scene...