Home › Knowledge Base › Audio Spectrogram Transformers

Audio Spectrogram Transformers

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

AudioRWKV: Efficient and Stable Bidirectional RWKV for Audio Pattern Recognition

Announce Type: replace Abstract: Recently, Transformers (e.g., Audio Spectrogram Transformers, AST) and state-space models (e.g., Audio Mamba, AuM) have achieved remarkable progress in audio modeling. However, the O(L^2) computational complexity of the Transformer architecture hinders efficient long-sequence processing, while the Mamba architecture tends to become unstable when scaling parameters and data.

arXiv CS 1d ago

MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion

arXiv:2606.09050v1 Announce Type: cross Abstract: Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms,...

arXiv CS 1d ago

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

arXiv:2606.03455v1 Announce Type: cross Abstract: Recently, diffusion models operating on VAE latents or mel-spectrograms have become the dominant paradigm for zero-shot TTS. Although these compressed representations improve generation efficiency, they inevitably suffer from information loss and non-end-to-end training. Theoretically, directly modeling raw waveforms circumvents these issues; however, this direction remains underexplored and is often deemed difficult due to the extremely long...

arXiv CS 7d ago

Feds unwittingly leak pilots' pre-crash conversation

The US National Transportation Safety Board (NTSB) released a spectrographic image derived from the cockpit audio of a UPS plane crash, despite a policy against releasing such recordings. Technically skilled individuals were able to reconstruct approximate audio from the image, prompting the NTSB to acknowledge the privacy breach. The board stated that federal law prohibits the public release of sensitive cockpit communications.

The Register 18d ago

C2GA: A Class-Controllable Generative Augmentation Framework for Respiratory Sound Classification

Announce Type: new Abstract: Background: Respiratory sound classification plays a critical role in the clinical identification of pulmonary pathologies. However, its performance is often hindered by the limited size, severe noise, and class imbalance of real-world auscultation datasets. Although conventional audio augmentation techniques are easy to implement, they may inadvertently distort subtle pathological characteristics.

arXiv CS 8d ago

AeroSpectra Sentinel: An Auditable LLM Prompt-Chaining Decision-Support Workflow for Acute Asthma Risk Assessment from Respiratory Sounds and Clinical Signals

Announce Type: cross Abstract: Acute asthma risk assessment requires rapid interpretation of respiratory sounds, oxygenation, airflow limitation, speech ability, work of breathing, mental status, and response to reliever therapy. Conventional audio-only classifiers can detect wheeze-like patterns but often lack transparent clinical reasoning and safe escalation logic.

arXiv CS 1d ago