Spectrograms
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Parameter-efficient Dual-encoder Architecture with Differentiable Choquet Integral Fusion for Underwater Acoustic Classification
Announce Type: new Abstract: Underwater acoustic classification has a wide array of oceanic applications, but faces challenges due to an increasingly complex acoustic environment. Waveform and spectrogram representations have been primarily used as acoustic data features for classification tasks in this domain. Spectrograms model harmonic dependencies, but these reduced representations can filter out acoustic features relevant for discrimination.
Parameter-efficient Dual-encoder Architecture with Differentiable Choquet Integral Fusion for Underwater Acoustic Classification
arXiv:2606.02341v2 Announce Type: replace Abstract: Underwater acoustic classification has a wide array of oceanic applications, but faces challenges due to an increasingly complex acoustic environment. Waveform and spectrogram representations have been primarily used as acoustic data features for classification tasks in this domain. Spectrograms model harmonic dependencies, but these reduced representations can filter out acoustic features relevant for discrimination.
Systematic Evaluation of Time-Frequency Features for Binaural Sound Source Localization
Announce Type: replace-cross Abstract: This study presents a systematic evaluation of time-frequency feature design for binaural sound source localization (SSL), focusing on how feature selection influences model performance across diverse conditions. We investigate the performance of a convolutional neural network (CNN) model using various combinations of amplitude-based features (magnitude spectrogram, interaural level difference - ILD) and phase-based features (phase spectrogram,...
WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling
arXiv:2606.03455v1 Announce Type: cross Abstract: Recently, diffusion models operating on VAE latents or mel-spectrograms have become the dominant paradigm for zero-shot TTS. Although these compressed representations improve generation efficiency, they inevitably suffer from information loss and non-end-to-end training. Theoretically, directly modeling raw waveforms circumvents these issues; however, this direction remains underexplored and is often deemed difficult due to the extremely long...
MyGardenBird: A Machine-Learning-Ready Bird Sound Dataset for Twelve Common Malaysian Birds
Announce Type: new Abstract: Bioacoustic datasets from tropical regions remain limited, in part due to the absence of reproducible workflows for aggregating recordings from public archives. We present \textbf{MyGardenBird}, a curated dataset of bird vocalisations representing twelve common species across Peninsular Malaysia and the Indo-Malayan region. Recordings were sourced from Xeno-canto and processed through species-level filtering, manual spectrogram segmentation, and quality control...
MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video
arXiv:2605.00242v2 Announce Type: replace Abstract: Millimetre-wave (mmWave) radar offers a more privacy-preserving alternative to RGB-based human pose estimation. However, existing methods typically rely on pre-extracted intermediate representations such as sparse point clouds or spectrogram images, where the rich spatiotemporal information naturally present in radar video streams is discarded for model learning, while such signal processing adds system complexity. In addition, existing...
AudioRWKV: Efficient and Stable Bidirectional RWKV for Audio Pattern Recognition
Announce Type: replace Abstract: Recently, Transformers (e.g., Audio Spectrogram Transformers, AST) and state-space models (e.g., Audio Mamba, AuM) have achieved remarkable progress in audio modeling. However, the O(L^2) computational complexity of the Transformer architecture hinders efficient long-sequence processing, while the Mamba architecture tends to become unstable when scaling parameters and data.
MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion
arXiv:2606.09050v1 Announce Type: cross Abstract: Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms,...
Multi-View Speech Representation Learning for Parkinson's Disease Detection Using Context-guided Cross-modal Attention
arXiv:2606.09271v1 Announce Type: new Abstract: Parkinson's disease (PD) is a progressive neurodegenerative disorder that frequently causes speech impairments associated with hypokinetic dysarthria. As speech production relies on the precise coordination of complex neuromuscular mechanisms, speech analysis has emerged as a promising non-invasive and cost-effective biomarker for early PD detection. Recent deep learning approaches have shown encouraging results; however, most existing methods...
Feds unwittingly leak pilots' pre-crash conversation
The US National Transportation Safety Board (NTSB) released a spectrographic image derived from the cockpit audio of a UPS plane crash, despite a policy against releasing such recordings. Technically skilled individuals were able to reconstruct approximate audio from the image, prompting the NTSB to acknowledge the privacy breach. The board stated that federal law prohibits the public release of sensitive cockpit communications.