Home › Knowledge Base › TTS

TTS

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation

arXiv:2606.09141v1 Announce Type: cross Abstract: Recent progress in speech dialogue systems requires Text-to-Speech (TTS) models to be faster and more responsive. Modern speech dialogue systems impose two primary requirements on TTS models: low latency and support for streaming inputs and outputs. However, most existing single-codebook LLM-based TTS methods rely on multi-stage pipelines that lack native streaming capabilities.

arXiv CS 1d ago

OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

arXiv:2606.09553v1 Announce Type: new Abstract: Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world's languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and...

arXiv CS 1d ago

End-to-End Training for Discrete Token LLM based TTS System

new Abstract: Recent state-of-the-art (SOTA) text-to-speech (TTS) systems typically adopt a cascaded pipeline consisting of a speech tokenizer, an autoregressive large language model (LLM), and a diffusion based flow-matching (FM) model, with these components trained independently. In this paper, we propose a fully end-to-end (E2E) optimization framework that unifies the training of the speech tokenizer, LLM, FM model, and an additional reward model (RM). Specifically, we first jointly...

arXiv CS 1d ago

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

arXiv:2606.03455v1 Announce Type: cross Abstract: Recently, diffusion models operating on VAE latents or mel-spectrograms have become the dominant paradigm for zero-shot TTS. Although these compressed representations improve generation efficiency, they inevitably suffer from information loss and non-end-to-end training. Theoretically, directly modeling raw waveforms circumvents these issues; however, this direction remains underexplored and is often deemed difficult due to the extremely long...

arXiv CS 7d ago

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

arXiv:2510.04593v3 Announce Type: replace-cross Abstract: Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework.

arXiv CS 1d ago

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

Announce Type: new Abstract: We present Chatterbox-Flash, a zero-shot text-to-speech model obtained by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while retaining block-by-block streaming. We find that naively transferring mainstream block-diffusion decoding to discrete speech tokens degrades quality, as a long-tail token distribution biases parallel position selection toward a few high-frequency...

arXiv CS 9d ago

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

arXiv:2605.30748v2 Announce Type: replace Abstract: We present Chatterbox-Flash, a zero-shot text-to-speech model obtained by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while retaining block-by-block streaming. We find that naively transferring mainstream block-diffusion decoding to discrete speech tokens degrades quality, as a long-tail token distribution biases parallel position selection toward a...

arXiv CS 8d ago

Task-Vector Arithmetic for Emotional Expressivity Control in Language-Model-Based Text-to-Speech

arXiv:2606.05367v1 Announce Type: new Abstract: We investigate whether task-vector arithmetic, successful for cross-speaker emotional intensity control in modular text-to-speech (TTS), transfers to large-scale TTS systems built on language-model backbones with in-context learning (LM-TTS). Through a systematic elimination study over four progressively narrower operands on Qwen3-TTS-12Hz-1.7B - model weights via LoRA fine-tuning, continuous codec embeddings, discrete codec tokens, and the...

arXiv CS 5d ago

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

arXiv:2603.07551v2 Announce Type: replace Abstract: Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynamically reconstruct voices from just reference prompts. We formalize this task as Speech Generation Speaker Poisoning (SGSP), in which we modify trained models to prevent the generation of specific...

arXiv CS 9d ago

GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech

Announce Type: new Abstract: We propose GLASS, a framework for composable acoustic style control in zero-shot autoregressive text-to-speech (TTS) that learns controls from post-generation rewards rather than style labels. In zero-shot TTS, a speaker prompt often entangles speaker identity with prosodic attributes such as speaking rate and pitch, making it difficult to change style without changing the prompt itself. GLASS instead treats each acoustic attribute as a reward-defined control...

arXiv CS 5d ago