{"id":1083075,"date":"2024-09-10T03:33:40","date_gmt":"2024-09-10T10:33:40","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=1083075"},"modified":"2024-09-10T22:22:55","modified_gmt":"2024-09-11T05:22:55","slug":"vall-e-2-enhancing-the-robustness-and-naturalness-of-text-to-speech-models","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/vall-e-2-enhancing-the-robustness-and-naturalness-of-text-to-speech-models\/","title":{"rendered":"VALL-E 2: Enhancing the robustness and naturalness of text-to-speech models"},"content":{"rendered":"\n
Author: Shujie Liu<\/a><\/em><\/p>\n\n\n\n <\/p>\n\n\n\n In recent years, the rapid advancement of AI has continually expanded the capabilities of Text-to-Speech (TTS) technology. Ongoing optimizations and innovations in TTS have enriched and simplified voice interaction experiences. These research developments hold significant potential across various fields, including education, entertainment, and multilingual communication, etc.<\/p>\n\n\n\n Traditional TTS systems, trained with high-quality clean data from the recording studio, still suffer from poor generalization. Speaker similarity and speech naturalness decrease dramatically for unseen speakers in the zero-shot scenario. To address this issue, researchers from MSR Asia developed VALL-E by introducing the LLM technique (training a model to predict the next token with large unsupervised sequential data) into the speech processing tasks. VALL-E is the first neural codec language model using discrete codes derived from an off-the-shelf neural audio codec model. It regards TTS as a conditional language model, emerging in-context learning capabilities. VALL-E is capable of synthesizing high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as a prompt. However, due to the auto-regressive modeling and the random sampling inference, VALL-E suffers from robustness and efficiency problems.<\/p>\n\n\n\n To deal with these problems, researchers proposed VALL-E 2, which leverages repetition-aware sampling and grouped code modeling techniques, achieving human parity in zero-shot TTS performance on LibriSpeech and VCTK datasets. Repetition aware sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. This approach not only stabilizes decoding but also circumvents the infinite loop issue observed in VALL-E. Additionally, grouped token modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling.<\/p>\n\n\n\n With these two techniques, VALL-E 2 surpasses previous systems in terms of speech robustness, naturalness, and speaker similarity. VALL-E 2 demonstrated constant ability to synthesize high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases.<\/p>\n\n\n\n VALL-E 2 paper<\/strong>: <\/strong>https:\/\/arxiv.org\/abs\/2406.05370 (opens in new tab)<\/span><\/a><\/p>\n\n\n\n VALL-E 2 demo<\/strong>: https:\/\/aka.ms\/valle2 (opens in new tab)<\/span><\/a><\/p>\n\n\n\n