Home › Knowledge Base › Benchmarking Speech

Benchmarking Speech

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Benchmarking Speech-to-Speech Translation Models

arXiv:2606.03241v1 Announce Type: new Abstract: Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and reproducible benchmarking framework integrating 46 metrics across eight dimensions, and deploy it on 1,248 model-language configurations from FLEURS and CVSS, spanning cascaded and end-to-end architectures over ten language...

arXiv CS 7d ago

Lost in Speech: Benchmarking, Evaluation, and Parsing of Spoken Bilingual Conversational Language Beyond Standard UD Assumptions

arXiv:2602.06307v2 Announce Type: replace Abstract: Spoken bilingual conversations pose substantial challenges for syntactic parsing because they often include disfluencies and discourse-driven structures that complicate dependency parsing under standard Universal Dependencies (UD) assumptions and evaluation practices. To systematically study these challenges, in this work, we first introduce a linguistically grounded taxonomy of conversational bilingual phenomena, together with SpokeBench,...

arXiv CS 1d ago

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

arXiv:2606.02642v1 Announce Type: cross Abstract: Despite the success of audio-visual large-language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination. Existing benchmarks focus on environmental sounds (e.g., dog barking) to indicate event occurrence. In contrast, human speech carries fundamentally different, rich semantics and temporal structures, yet it remains unexplored whether current models can accurately align speech content with corresponding...

arXiv CS 7d ago

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

Announce Type: replace-cross Abstract: Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unrelated characteristics. Despite rapid progress in Speech Large Language Models (Speech LLMs), systematic evaluation of this capability remains challenging, as existing benchmarks are fragmented across isolated editing tasks. To bridge this gap, we introduce SpeechEditBench, a bilingual multi-attribute benchmark for instruction-guided speech editing.

arXiv CS 6d ago

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

arXiv:2606.01804v1 Announce Type: cross Abstract: Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unrelated characteristics. Despite rapid progress in Speech Large Language Models (Speech LLMs), systematic evaluation of this capability remains challenging, as existing benchmarks are fragmented across isolated editing tasks. To bridge this gap, we introduce \textbf{SpeechEditBench}, a bilingual multi-attribute benchmark for...

arXiv CS 8d ago

AUDDT: A Unified Benchmark Toolkit for Audio and Speech Deepfake Detectors

arXiv:2509.21597v2 Announce Type: replace-cross Abstract: With the prevalence of artificial intelligence (AI)-generated content, such as audio deepfakes, a large body of recent work has focused on developing deepfake detection techniques. However, existing benchmarks employ a narrow set of datasets, leaving detector generalization to real-world conditions uncertain. In this paper, we systematically review 31 existing audio deepfake datasets and present an open-source benchmarking toolkit...

arXiv CS 6d ago

MultiAPI Spoof: A Multi-API Dataset and Local-Attention Network for Speech Anti-spoofing Detection

arXiv:2512.07352v4 Announce Type: replace Abstract: Existing speech anti-spoofing benchmarks rely on a narrow set of public models, creating a substantial gap from real-world scenarios in which commercial systems employ diverse, often proprietary APIs. To address this issue, we introduce MultiAPI Spoof, a multi-API audio anti-spoofing dataset comprising about 230 hours of synthetic speech generated by 30 distinct APIs, including commercial services, open-source models, and online platforms....

arXiv CS 1d ago

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

arXiv:2605.28508v2 Announce Type: replace Abstract: Existing AI evaluation practices often fail to capture how systems actually perform in low-resource environments, where operational constraints shape usability as much as model quality. Through a structured analysis of existing benchmark families across speech, chat/RAG, and vision systems, we identify critical gaps between laboratory evaluation practices and real-world deployment conditions in low-resource environments. We argue that the...

arXiv CS 8d ago

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

Announce Type: replace Abstract: Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI word-level lipreading dataset using word, character, phoneme, and viseme-level metrics. Although models achieve higher overall accuracy, they succeed and fail on different words than humans.

arXiv CS 1d ago

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

arXiv:2606.07435v1 Announce Type: new Abstract: Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI word-level lipreading dataset using word, character, phoneme, and viseme-level metrics. Although models achieve higher overall accuracy, they succeed and fail on different words than humans.

arXiv CS 2d ago