Home › Knowledge Base › the Temporal Understanding

the Temporal Understanding

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Announce Type: replace Abstract: Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, with autonomous driving (AD) being one of the most safety-critical instances. Reliable temporal understanding is essential for such agents to anticipate events, attribute causes, and act safely in dynamic environments, yet this remains a significant challenge even for state-of-the-art (SoTA) VLMs. Prior video benchmarks...

arXiv CS 8d ago

From Segments to Scenes: Temporal Understanding for Agentic Autonomous Driving via Vision-Language Models

arXiv CS 6d ago

SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM

arXiv:2511.14143v2 Announce Type: replace Abstract: Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a natural language query. Despite recent progress in moment retrieval from videos using both traditional techniques and Multimodal Large Language Models (MLLM), most existing methods still rely on coarse temporal understanding and a single visual modality, limiting performance on complex videos. To address...

arXiv CS 1d ago

TempoBench: Evaluating Temporal Causal Reasoning in Large Language Models

arXiv:2510.27544v2 Announce Type: replace Abstract: Temporal reasoning involves understanding how systems evolve over time through input-driven state transitions. A key aspect is temporal causal reasoning, causally reasoning about what prior inputs were necessary in causing an observed outcome. While large language models (LLMs) perform well at forward simulation, predicting outputs from inputs, they struggle to identify the minimal causal inputs of outcomes.

arXiv CS 1d ago

CoSTL: Comprehensive Spatial-Temporal Representation Learning for Moment Retrieval and Highlight Detection

arXiv:2606.01149v1 Announce Type: new Abstract: Video Moment Retrieval (MR) and Highlight Detection (HD) are crucial tasks in video analysis that aim to localize specific moments and estimate clip-wise relevance based on a given text query. Recent approaches treat them as similar video grounding tasks and use the same architecture to solve them. These tasks require both fine-grained comprehension at the image level and high-level temporal understanding across the entire video.

arXiv CS 8d ago

VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning

Announce Type: new Abstract: Video reasoning aims to understand complex temporal events and causal relationships within videos. Recently, Chain-of-Thought (CoT) has been introduced to this field to enhance reasoning accuracy. However, existing CoT-based video reasoning methods primarily rely on text-only information for logical deduction, overlooking critical visual information during the inference process.

arXiv CS 5d ago

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects

arXiv:2510.14904v3 Announce Type: replace Abstract: Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to training strategies with limited data, potentially leading to suboptimal performance. To circumvent...

arXiv CS 9d ago

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects

arXiv:2510.14904v4 Announce Type: replace Abstract: Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to training strategies with limited data, potentially leading to suboptimal performance. To circumvent...

arXiv CS 8d ago

Characterizing Online Criticism of Partisan News Media Using Weakly Supervised Learning

Announce Type: new Abstract: We propose novel methods to identify tweets that criticize partisan news sources. Prior work suggests that criticism, ridicule, and distrust of news media all play important roles in hyperpartisanship, misinformation, and filter bubble formation. Thus, understanding the prevalence and temporal dynamics of media-targeted criticism can provide us with updated tools to assess the health of the information ecosystem.

arXiv CS 6d ago

Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding

Announce Type: replace Abstract: Long-form video understanding, characterized by long-range temporal dependencies and multiple events, remains a challenge. Existing methods often rely on static reasoning or external visual-language models (VLMs), which face issues like complexity and sub-optimal performance due to the lack of end-to-end training. In this paper, we propose Video-MTR, a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and...

arXiv CS 9d ago