Exploring the Scale and Diversity of Speech Anti-spoofing Datasets: Experiments and Analysis

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Zhuolin Yi, Jun Xue, Yanzhen Ren, Yihuan Huang, Yi Chai, Daixian Li, Guanxiang Feng, Jiajun Liu 1 min read

Key Points

arXiv:2606.08038v1 Announce Type: new Abstract: The scale of speech anti-spoofing datasets has grown exponentially over the past decade, driven by the assumption that larger data leads to better performance. However, it remains unclear whether indiscriminate scaling commensurately improves model generalization. This study challenges the "scale-first" paradigm by decoupling the impacts of training data scale versus diversity. Through experiments on representative datasets, we report two key findings: (1) Larger is not always better. Expanding data scale excessively under fixed generation methods yields negligible returns and may even degrade cross-domain generalization due to overfitting.(2) Diversity outweighs scale. A smaller composite training set featuring diverse attacks significantly outperforms larger-scale datasets with limited diversity in cross-dataset evaluations. We conclude that future dataset construction should prioritize the diversity of generation methods over scale to effectively enhance model generalization.

Originally published by arXiv CS Read original →

Exploring the Scale and Diversity of Speech Anti-spoofing Datasets: Experiments and Analysis

Related Stories

SpaceX Price Tag is 'Very Steep': Renaissance's Kennedy

World's biggest whale graveyard found in Indian Ocean off Australia

The big question facing SpaceX investors: What are you really buying?

'Voltron: Legendary Defender' turns 10 today, and we think this mecha robot reboot was just as good as 'Power Rangers' and 'Transformers'