Audio Spectrogram Transformers
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
AudioRWKV: Efficient and Stable Bidirectional RWKV for Audio Pattern Recognition
Announce Type: replace Abstract: Recently, Transformers (e.g., Audio Spectrogram Transformers, AST) and state-space models (e.g., Audio Mamba, AuM) have achieved remarkable progress in audio modeling. However, the O(L^2) computational complexity of the Transformer architecture hinders efficient long-sequence processing, while the Mamba architecture tends to become unstable when scaling parameters and data.
MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion
arXiv:2606.09050v1 Announce Type: cross Abstract: Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms,...
WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling
arXiv:2606.03455v1 Announce Type: cross Abstract: Recently, diffusion models operating on VAE latents or mel-spectrograms have become the dominant paradigm for zero-shot TTS. Although these compressed representations improve generation efficiency, they inevitably suffer from information loss and non-end-to-end training. Theoretically, directly modeling raw waveforms circumvents these issues; however, this direction remains underexplored and is often deemed difficult due to the extremely long...
Feds unwittingly leak pilots' pre-crash conversation
The US National Transportation Safety Board (NTSB) released a spectrographic image derived from the cockpit audio of a UPS plane crash, despite a policy against releasing such recordings. Technically skilled individuals were able to reconstruct approximate audio from the image, prompting the NTSB to acknowledge the privacy breach. The board stated that federal law prohibits the public release of sensitive cockpit communications.
C2GA: A Class-Controllable Generative Augmentation Framework for Respiratory Sound Classification
Announce Type: new Abstract: Background: Respiratory sound classification plays a critical role in the clinical identification of pulmonary pathologies. However, its performance is often hindered by the limited size, severe noise, and class imbalance of real-world auscultation datasets. Although conventional audio augmentation techniques are easy to implement, they may inadvertently distort subtle pathological characteristics.
AeroSpectra Sentinel: An Auditable LLM Prompt-Chaining Decision-Support Workflow for Acute Asthma Risk Assessment from Respiratory Sounds and Clinical Signals
Announce Type: cross Abstract: Acute asthma risk assessment requires rapid interpretation of respiratory sounds, oxygenation, airflow limitation, speech ability, work of breathing, mental status, and response to reliever therapy. Conventional audio-only classifiers can detect wheeze-like patterns but often lack transparent clinical reasoning and safe escalation logic.