Home Knowledge Base Benchmarking Visual State Tracking

Benchmarking Visual State Tracking

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Benchmarking Visual State Tracking in Multimodal Video Understanding

arXiv:2606.03920v1 Announce Type: new Abstract: Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs.

arXiv CS 7d ago

Fully Spiking Neural Networks with Target Awareness for Energy-Efficient UAV Tracking

Announce Type: replace Abstract: Spiking Neural Networks (SNNs), characterized by their event-driven computation and low power consumption, have shown great potential for energy-efficient visual tracking on unmanned aerial vehicles (UAVs). However, existing SNN-based trackers often rely on costly event cameras, which limits their deployment on standard RGB-camera UAV platforms.

arXiv CS 1d ago

Dual Quaternion-Based Unscented Kalman Filter with Visual Inertial Odometry for Navigation in GPS-Denied Environments

Announce Type: new Abstract: Reliable navigation in GPS-denied environments remains a fundamental challenge in robotics, aerospace, and autonomous vehicle applications. This paper presents a Dual Quaternion-Based Unscented Kalman Filter (DQUKF) equipped with a Visual Inertial Odometry (VIO) algorithm for accurate state estimation enabling navigation in GPS denied locations. The proposed framework formulates the DQUKF in an error state manner, where the nominal pose is represented by a unit...

arXiv CS 1d ago

Rethinking Search as Code Generation

Rethinking Search as Code Generation Evolving search from monolithic services to programmable primitives for the era of agent harnesses. Search is a core primitive for AI systems. Frontier models grow more capable by the month, but they still need access to fresh, accurate, and well-curated knowledge from the wider world.

Hacker News 8d ago

IntentNav: Learning Spatial-Visual Object Navigation from Human Demonstrations

Announce Type: new Abstract: Object navigation requires a robot to search for an unobserved target in an unknown environment by deciding where to explore next under partial observability. Effective search resembles human-like exploration: selectively probing visually promising frontiers while relying on spatial memory to avoid redundant revisits. We propose IntentNav, a spatial-visual imitation framework that learns human-like ObjectNav policies from human demonstrations.

arXiv CS 1d ago

HLL: Can Agents Cross Humanity's Last Line of Verification?

arXiv:2606.02449v1 Announce Type: new Abstract: Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly substitute for humans in workflows that services deliberately protect against automation? CAPTCHA verification makes this question concrete. It is not merely a visual puzzle, but a human-verification boundary placed before account creation, content access, form submission, and other protected actions.

arXiv CS 8d ago

ChatUMM: Robust Context Tracking for Conversational Interleaved Generation

arXiv:2602.06442v2 Announce Type: replace Abstract: Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm, effectively functioning as solvers for independent requests rather than assistants in continuous dialogue. To bridge this gap, we present ChatUMM. As a conversational unified model, it excels at robust context tracking to sustain interleaved multimodal generation.

arXiv CS 8d ago

By the numbers: 100 days of the US-Israel war on Iran

By the numbers: 100 days of the US-Israel war on Iran From the human cost to the economy, Al Jazeera visualises how the US-Israel war on Iran has unfolded since February 28. Sunday marks 100 days into a war that US President Donald Trump said was going to finish “very fast”. Despite a ceasefire agreed on April 8, the Strait of Hormuz remains largely closed, sporadic fire continues, and talks have repeatedly collapsed.

Al Jazeera 3d ago

TraRA: Trajectory-level Recognition Aggregation for Video Text Spotting in Urban Surveillance

arXiv:2606.07161v1 Announce Type: new Abstract: Video Text Spotting (VTS) is essential for urban surveillance and intelligent transportation systems, enabling automated reading of street signs, vehicle markings, and scene text in video streams. However, reliable recognition remains challenging due to dynamic video factors common in surveillance scenarios, including motion blur, occlusion, and scale variation, which degrade frame-level recognition. Existing VTS methods typically perform...

arXiv CS 2d ago

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

Announce Type: replace Abstract: Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or...

arXiv CS 8d ago