Autoregressive Speech Synthesis without Vector Quantization

Lingwei Meng; Long Zhou; Shujie Liu; Sanyuan Chen; Bing Han; Shujie Hu; Yanqing Liu; Jinyu Li; Sheng Zhao; Xixin Wu; Helen Meng; Furu Wei

Autoregressive Speech Synthesis without Vector Quantization

Lingwei Meng ,
Long Zhou ,
Shujie Liu ,
Sanyuan Chen ,
Bing Han ,
Shujie Hu ,
Yanqing Liu ,
Jinyu Li ,
Sheng Zhao ,
Xixin Wu ,
Helen Meng ,
Furu Wei

July 2024

arXiv

We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamline paradigm.