A Camera-Native Talking-Head Video Dataset for Various Computer Vision Tasks

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Babak Naderi, Ross Cutler, Nabakumar Singh Khongbantabam 1 min read

Key Points

arXiv:2603.26763v2 Announce Type: replace Abstract: Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a camera-native dataset of 847 talking-head recordings (approximately 212 minutes), each 15s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native signal -- uncompressed (24.4%) or MJPEG-encoded (75.6%) -- without additional lossy processing. Each recording is annotated with a Mean Opinion Score (MOS) and ten perceptual quality tokens that jointly explain 64.4% of the MOS variance. From this corpus, we curate a stratified benchmarking subset of 120 clips in three content conditions: original, background blur, and background replacement. Codec efficiency evaluation across four datasets and four codecs, namely H.264, H.265, H.266, and AV1, yields VMAF BD-rate savings up to $-71.3\%$ (H.266) relative to H.264, with significant encoder$\times$dataset ($\eta_p^2 = .112$) and encoder$\times$content condition ($\eta_p^2 = .149$) interactions, demonstrating that both content type and background processing affect compression efficiency. A preliminary super-resolution evaluation with four SR models confirms that the dataset significantly affects absolute performance while preserving model rankings, demonstrating applicability beyond codec benchmarking. The dataset offers 5$\times$ the scale of the largest prior talking-head webcam dataset (847 vs. 160 clips) with lossless signal fidelity, establishing a resource for benchmarking video compression, super-resolution, quality assessment, and enhancement models in real-time communication.

FFV1 (ORG) MJPEG (ORG) a Mean Opinion Score (ORG) VMAF (ORG) SR (ORG) fidelity (ORG)

Originally published by arXiv CS Read original →

A Camera-Native Talking-Head Video Dataset for Various Computer Vision Tasks

Related Stories

Robots are about to overtake armed soldiers as the deciders of war

3 in 4 parents still feel emotionally attached to childhood toys, poll finds

Homeowners spend £11k on upgrades to avoid moving costs

Homes evacuated and two arrested as police search Essex property for explosives