Home Knowledge Base Audio

Audio

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Audio-Oscar: A Multi-Agent System for Complex Audio Scene Generation, Orchestration, and Refinement

arXiv:2606.07397v1 Announce Type: new Abstract: In recent years, audio generation has made significant progress in tasks such as text-to-speech (TTS), text-to-audio (TTA) and text-to-music (TTM). However, generating long-form and controllable audio from complex audio scene descriptions remains a significant challenge, as such scenes often require coordinated speech, sound effects, music, songs, temporal structure, and post-production. In this work, we introduce \textbf{Audio-Oscar}, a...

arXiv CS 2d ago

Audio-FLAN: An Instruction-Following Dataset for Unified Audio Understanding and Generation of Speech, Music, and Sound

Announce Type: replace Abstract: Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains...

arXiv CS 1d ago

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

arXiv:2604.18360v2 Announce Type: replace Abstract: Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment of practical retrieval robustness. We present Omni-Embed-Audio (OEA), a retrieval-oriented encoder leveraging multimodal LLMs with native audio understanding. To...

arXiv CS 8d ago

Universal Audio Volt 876 USB Audio Interface Review: Pro-Level Polish

In the fall of 2006, I decided emo was out and IDM was in. Fueled by the hope of becoming the next Four Tet or Aphex Twin, I marched into my local Guitar Center and purchased an audio interface to convert my guitar and vocals into ones and zeroes, then mangle them in Ableton Live. When I got home, I plugged a brand-new M-Audio Fast Track Pro into my Windows desktop and immediately hit a brick wall of audio driver configuration hell.

Wired 9d ago

Focus Then Listen: An Empirical Study of Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models

arXiv:2603.04862v4 Announce Type: replace Abstract: Large audio language models (LALMs) are a class of foundation models for audio understanding. Existing LALMs tend to degrade significantly in real-world noisy acoustic conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can improve robustness, it requires task-specific noisy data and expensive retraining, limiting scalability.

arXiv CS 1d ago

Audio Pirates: Black-box Audio Watermark Removal via Diffusion Priors

Announce Type: new Abstract: With the rise of AI-generated audio, watermarking has become widely used for detecting misuse and protecting intellectual property. However, adversaries may try to remove these watermarks, making it critical to evaluate how well watermarking schemes withstand removal attacks. Existing attacks are often impractical: they either noticeably degrade perceptual quality or require access to the watermarking scheme.

arXiv CS 9d ago

MOSS-Audio Technical Report

arXiv:2606.01802v2 Announce Type: replace Abstract: MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder...

arXiv CS 7d ago

MOSS-Audio Technical Report

arXiv:2606.01802v3 Announce Type: replace Abstract: MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder...

arXiv CS 2d ago

MOSS-Audio Technical Report

arXiv:2606.01802v1 Announce Type: new Abstract: MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates...

arXiv CS 8d ago

Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models

arXiv:2606.05161v1 Announce Type: new Abstract: Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference.

arXiv CS 6d ago