{"id":1045257,"date":"2024-06-27T20:19:16","date_gmt":"2024-06-28T03:19:16","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=1045257"},"modified":"2024-06-28T02:22:18","modified_gmt":"2024-06-28T09:22:18","slug":"emoctrl-tts","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/emoctrl-tts\/","title":{"rendered":"EmoCtrl-TTS"},"content":{"rendered":"
\n\t
\n\t\t
\n\t\t\t\t\t<\/div>\n\t\t\n\t\t
\n\t\t\t\n\t\t\t
\n\t\t\t\t\n\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n

EmoCtrl-TTS<\/h1>\n\n\n\n

Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n

EmoCtrl-TTS<\/strong> is an emotion-controllable zero-shot TTS that can generate highly emotional speech with non-verbal vocalizations such as laughter and crying<\/strong> for any speaker. EmoCtrl-TTS is purely a research project. Currently, we have no plans to incorporate EmoCtrl-TTS into a product or expand access to the public.<\/p>\n\n\n\n

\n
Read the paper<\/a><\/div>\n<\/div>\n\n\n\n

Controlling time-varying emotional states of zero-shot text-to-speech<\/h2>\n\n\n\n

EmoCtrl-TTS utilizes embeddings that represent emotion and non-verbal vocalizations to condition the flow-matching-based zero-shot TTS. In order to generate high-quality emotional speech, EmoCtrl-TTS is trained with over 27,000 hours of expressive data, curated using pseudo-labeling.<\/p>\n\n\n\n

\n
\n
\"Overview\"<\/figure>\n<\/div>\n\n\n\n
\n

EmoCtrl-TTS can generate a voice of any speaker with non-verbal vocalizations like laughter and crying.<\/strong><\/p>\n\n\n\n

Generated speech samples <\/p>\n\n\n\n

\n
\n
\n