{"id":947154,"date":"2023-06-09T01:13:41","date_gmt":"2023-06-09T08:13:41","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=947154"},"modified":"2024-08-27T05:44:33","modified_gmt":"2024-08-27T12:44:33","slug":"vall-e-x","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/vall-e-x\/","title":{"rendered":"VALL-E"},"content":{"rendered":"
\n\t
\n\t\t
\n\t\t\t\"background\t\t<\/div>\n\t\t\n\t\t
\n\t\t\t\n\t\t\t
\n\t\t\t\t\n\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n

VALL-E<\/h1>\n\n\n\n

A neural codec language model for speech synthesis<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n

We introduce a language modeling approach for text-to-speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E<\/strong>) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as a prompt. VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis. Extending its capabilities, VALL-E X<\/strong> adapts to multi-lingual scenarios, facilitating cross-lingual zero-shot TTS. Meanwhile, VALL-E R<\/strong> introduces a phoneme monotonic alignment strategy, bolstering the robustness of speech generation. With the integration of repetition-aware sampling and grouped code modeling techniques, VALL-E 2<\/strong> achieves a groundbreaking milestone: human parity in zero-shot TTS performance on LibriSpeech and VCTK datasets. This marks the first instance of such an achievement, setting a new standard for the field. MELLE <\/strong>is a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms.<\/p>\n\n\n\n

Model versions<\/h2>\n\n\n\n\n\n\n\n\n\n\n
\n
\"VALL-E<\/figure>\n<\/td>\n
\n
\"VALL-E<\/figure>\n<\/td>\n<\/tr>\n
\n\n
\n
\"VALL-E<\/figure>\n<\/td>\n
\n
\"VALL-E<\/figure>\n<\/td>\n<\/tr>\n
\n\n
\n
\"MELLE<\/figure>\n<\/td>\n
\n<\/td>\n<\/tr>\n
\n\n<\/td>\n<\/tr>\n<\/table>\n\n\n\n

Ethics statement<\/h2>\n\n\n\n

VALL-E could synthesize speech that maintains speaker identity and could be used for educational learning, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chatbot, and so on. While VALL-E can speak in a voice like the voice talent, the similarity, and naturalness depend on the length and quality of the speech prompt, the background noise, as well as other factors. It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agrees to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model. If you suspect that VALL-E is being used in a manner that is abusive or illegal or infringes on your rights or the rights of other people, you can report it at the Report Abuse Portal.<\/p>\n\n\n\n\n\n

VALL-E is a language modeling approach for text-to-speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker<\/strong> as an acoustic prompt. Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker\u2019s emotion and acoustic environment of the acoustic prompt in synthesis.<\/p>\n\n\n\n

This page is for research demonstration purposes<\/strong> only.<\/p>\n\n\n\n

\n

Model Overview<\/h2>\n\n\n\n

Unlike the previous pipeline (e.g., phoneme \u2192 mel-spectrogram \u2192 waveform), the pipeline of VALL-E is phoneme \u2192 discrete code \u2192 waveform. VALL-E generates the discrete audio codec codes based on phoneme and acoustic code prompts, corresponding to the target content and the speaker\u2019s voice. VALL-E directly enables various speech synthesis applications, such as zero-shot TTS, speech editing, and content creation combined with other generative AI models like GPT.<\/p>\n<\/div>

\"VALL-E<\/figure><\/div>\n\n\n\n

Zero-shot TTS for LibriSpeech and VCTK dataset\u00a0<\/h2>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Text<\/th>\nSpeaker Prompt<\/th>\nGround Truth<\/th>\nBaseline<\/th>\nVALL-E<\/th>\n<\/tr>\n<\/thead>\n
They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission.<\/td>\n\n
\n
\n
\n
And lay me down in thy cold bed and leave my shining lot.<\/td>\n\n
\n
\n
\n
Number ten, fresh nelly is waiting on you, good night husband.<\/td>\n\n
\n
\n
\n
Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.<\/td>\n\n
\n
\n
\n
Instead of shoes, the old man wore boots with turnover tops, and his blue coat had wide cuffs of gold braid.<\/td>\n\n
\n
\n
\n
The army found the people in poverty and left them in comparative wealth.<\/td>\n\n
\n
\n
\n
Thus did this humane and right minded father comfort his unhappy daughter, and her mother embracing her again, did all she could to soothe her feelings.<\/td>\n\n
\n
\n
\n
He was in deep converse with the clerk and entered the hall holding him by the arm.<\/td>\n\n
\n
\n
\n