Home › Knowledge Base › TTS System

TTS System

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

End-to-End Training for Discrete Token LLM based TTS System

new Abstract: Recent state-of-the-art (SOTA) text-to-speech (TTS) systems typically adopt a cascaded pipeline consisting of a speech tokenizer, an autoregressive large language model (LLM), and a diffusion based flow-matching (FM) model, with these components trained independently. In this paper, we propose a fully end-to-end (E2E) optimization framework that unifies the training of the speech tokenizer, LLM, FM model, and an additional reward model (RM). Specifically, we first jointly...

arXiv CS 1d ago

FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation

arXiv:2606.09141v1 Announce Type: cross Abstract: Recent progress in speech dialogue systems requires Text-to-Speech (TTS) models to be faster and more responsive. Modern speech dialogue systems impose two primary requirements on TTS models: low latency and support for streaming inputs and outputs. However, most existing single-codebook LLM-based TTS methods rely on multi-stage pipelines that lack native streaming capabilities.

arXiv CS 1d ago

Task-Vector Arithmetic for Emotional Expressivity Control in Language-Model-Based Text-to-Speech

arXiv:2606.05367v1 Announce Type: new Abstract: We investigate whether task-vector arithmetic, successful for cross-speaker emotional intensity control in modular text-to-speech (TTS), transfers to large-scale TTS systems built on language-model backbones with in-context learning (LM-TTS). Through a systematic elimination study over four progressively narrower operands on Qwen3-TTS-12Hz-1.7B - model weights via LoRA fine-tuning, continuous codec embeddings, discrete codec tokens, and the...

arXiv CS 5d ago

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

arXiv:2510.04593v3 Announce Type: replace-cross Abstract: Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework.

arXiv CS 1d ago

N\"ushuVoice: Reviving the Voice of Endangered N\"ushu with Pitch-Aware Text-to-Speech

Announce Type: new Abstract: N\"ushu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China. While existing computational studies of N\"ushu mainly focus on textual digitization and visual recognition, the acoustic reconstruction of its authentic pronunciation remains largely unexplored. Building a N\"ushu text-to-speech (TTS) system is particularly challenging because available recordings are extremely limited and mostly consist of isolated...

arXiv CS 1d ago

Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech

arXiv:2606.01479v1 Announce Type: new Abstract: Integrating large language models (LLMs) into text-to-speech (TTS) systems has improved speech expressiveness, yet interpretable emotional control remains challenging. Existing approaches primarily rely on external conditioning or global activation steering, offering limited insight into the internal representations underlying emotional control.

arXiv CS 8d ago

UniVoice: A Unified Model for Speech and Singing Voice Generation

arXiv:2606.05852v1 Announce Type: new Abstract: Text-to-speech (TTS) and singing voice synthesis (SVS) both aim to generate human vocal audio from symbolic inputs, but they impose different requirements on the generation process. Speech generation relies on flexible, language-driven prosody, whereas singing generation requires explicit melody control and accurate rhythmic alignment. This mismatch makes it challenging to train a single model that can generate both natural speech and...

arXiv CS 5d ago

UniVocal: Unified Speech-Singing Code-Switching Synthesis

arXiv:2606.01677v1 Announce Type: new Abstract: We propose UniVocal, a unified framework that implicitly infers vocal modes from text context to pioneer Speech-Singing Code-Switching (SCS) Synthesis - a task where transitions are autonomously driven by textual semantics, akin to seamless human language blending. Unlike single-mode generation or systems relying on switching-control tags, our proposed UniVocal implicitly infers vocal modes solely from text context. To achieve this, we employ a...

arXiv CS 8d ago

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

arXiv:2606.09019v1 Announce Type: new Abstract: Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that...

arXiv CS 1d ago

OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

arXiv:2606.09553v1 Announce Type: new Abstract: Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world's languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and...

arXiv CS 1d ago