{"id":1045248,"date":"2024-06-25T17:56:09","date_gmt":"2024-06-26T00:56:09","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=1045248"},"modified":"2024-08-06T14:11:26","modified_gmt":"2024-08-06T21:11:26","slug":"e2-tts","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/e2-tts\/","title":{"rendered":"E2 TTS"},"content":{"rendered":"
\n\t
\n\t\t
\n\t\t\t\t\t<\/div>\n\t\t\n\t\t
\n\t\t\t\n\t\t\t
\n\t\t\t\t\n\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n

E2 TTS<\/h1>\n\n\n\n

Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n

E2 TTS (Embarrassingly Easy TTS)<\/strong> is a fully non-autoregressive zero-shot text-to-speech (TTS) system capable of generating the voice of any speaker. Despite its extremely simple <\/strong>model architecture and training scheme, E2 TTS achieves human-level naturalness, and state-of-the-art speaker similarity and intelligibility<\/strong>.<\/p>\n\n\n\n

\n
Read the paper<\/a><\/div>\n<\/div>\n\n\n\n

State-of-the-art zero-shot TTS with simple architecture<\/h2>\n\n\n\n

E2 TTS consists of only two modules: the flow-matching Transformer and the vocoder. The input is a sequence of characters with filler tokens. It does not include any additional components such as a duration model or a grapheme-to-phoneme converter, nor does it use complex techniques like monotonic alignment search or cross-attention in a specific architecture.<\/p>\n\n\n\n

\n
\n
\"E2<\/figure>\n<\/div>\n\n\n\n
\n

E2 TTS is a zero-shot TTS system that can generate a voice of any speaker using a short audio sample (a.k.a. an audio prompt).<\/strong><\/p>\n\n\n\n

\n
\n

Audio prompt<\/p>\n\n\n\n

\n