Audio
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Audio-Oscar: A Multi-Agent System for Complex Audio Scene Generation, Orchestration, and Refinement
arXiv:2606.07397v1 Announce Type: new Abstract: In recent years, audio generation has made significant progress in tasks such as text-to-speech (TTS), text-to-audio (TTA) and text-to-music (TTM). However, generating long-form and controllable audio from complex audio scene descriptions remains a significant challenge, as such scenes often require coordinated speech, sound effects, music, songs, temporal structure, and post-production. In this work, we introduce \textbf{Audio-Oscar}, a...
Audio-FLAN: An Instruction-Following Dataset for Unified Audio Understanding and Generation of Speech, Music, and Sound
Announce Type: replace Abstract: Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains...
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval
arXiv:2604.18360v2 Announce Type: replace Abstract: Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment of practical retrieval robustness. We present Omni-Embed-Audio (OEA), a retrieval-oriented encoder leveraging multimodal LLMs with native audio understanding. To...
Universal Audio Volt 876 USB Audio Interface Review: Pro-Level Polish
In the fall of 2006, I decided emo was out and IDM was in. Fueled by the hope of becoming the next Four Tet or Aphex Twin, I marched into my local Guitar Center and purchased an audio interface to convert my guitar and vocals into ones and zeroes, then mangle them in Ableton Live. When I got home, I plugged a brand-new M-Audio Fast Track Pro into my Windows desktop and immediately hit a brick wall of audio driver configuration hell.
Focus Then Listen: An Empirical Study of Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models
arXiv:2603.04862v4 Announce Type: replace Abstract: Large audio language models (LALMs) are a class of foundation models for audio understanding. Existing LALMs tend to degrade significantly in real-world noisy acoustic conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can improve robustness, it requires task-specific noisy data and expensive retraining, limiting scalability.
Audio Pirates: Black-box Audio Watermark Removal via Diffusion Priors
Announce Type: new Abstract: With the rise of AI-generated audio, watermarking has become widely used for detecting misuse and protecting intellectual property. However, adversaries may try to remove these watermarks, making it critical to evaluate how well watermarking schemes withstand removal attacks. Existing attacks are often impractical: they either noticeably degrade perceptual quality or require access to the watermarking scheme.
MOSS-Audio Technical Report
arXiv:2606.01802v2 Announce Type: replace Abstract: MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder...
MOSS-Audio Technical Report
arXiv:2606.01802v3 Announce Type: replace Abstract: MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder...
MOSS-Audio Technical Report
arXiv:2606.01802v1 Announce Type: new Abstract: MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates...
Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models
arXiv:2606.05161v1 Announce Type: new Abstract: Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference.