EmoCtrl-TTS is an emotion-controllable zero-shot TTS that can generate highly emotional speech with non-verbal vocalizations such as laughter and crying for any speaker. EmoCtrl-TTS is purely a research project. Currently, we have no plans to incorporate EmoCtrl-TTS into a product or expand access to the public.
Controlling time-varying emotional states of zero-shot text-to-speech
EmoCtrl-TTS utilizes embeddings that represent emotion and non-verbal vocalizations to condition the flow-matching-based zero-shot TTS. In order to generate high-quality emotional speech, EmoCtrl-TTS is trained with over 27,000 hours of expressive data, curated using pseudo-labeling.
EmoCtrl-TTS can generate a voice of any speaker with non-verbal vocalizations like laughter and crying.
Generated speech samples
EmoCtrl-TTS is specifically designed to capture the time-varying emotional states found in the audio prompt.
Audio prompt (Angry → Calm)
Generate speech by Voicebox (prior work)
Generated speech by EmoCtrl-TTS (our work)
Audio samples
Below, we included audio samples demonstrating how EmoCtrl-TTS performs. The speech samples were taken from JVNV dataset, DiariST-AliMeeting dataset, and RAVDESS dataset. The speech samples below are provided for the sole purpose of illustrating EmoCtrl-TTS.
Capturing the time-varying emotional states
EmoCtrl-TTS can generate a speech that closely mimics the time-varying emotional states found in the audio prompt. In these demo samples, the audio prompt is created by concatenating two audio samples from RAVDESS data set. The text prompt is “dogs are sitting by the door dogs are sitting by the door” for all generated speech samples.
Emotion | Audio prompt | Generated audio | ||
---|---|---|---|---|
Voicebox | ELaTE | EmoCtrl-TTS | ||
Angry → Calm | ||||
Sad → Surprised | ||||
Happy → Disgusted | ||||
Calm → Fearful | ||||
Generating non-verbal vocalization
EmoCtrl-TTS can generate non-verbal vocalizations, such as laughter and crying, that closely match the audio prompt.
Laughing speech generation
(Audio prompt from AliMeeting-DiariST dataset; real conversational speech in Chinese)
Audio prompt (Chinese) | Text prompt (English) | Generated audio | ||
---|---|---|---|---|
Voicebox | ELaTE | EmoCtrl-TTS | ||
Ah, look, right, isn’t it? At a glance, oh, yes, then maybe play for a while. Oh, maybe we’ll be fine. | ||||
You remind me of the kitchen knives sold in the morning market. | ||||
But I think buying these financial products won’t be fooled. | ||||
But don’t you think after seeing that number you feel very panicked and very uncomfortable inside? | ||||
You take a look at your share first. |
Crying speech generation
(Audio prompt from JVNV dataset; staged speech in Japanese)
Audio prompt (Japanese) | Text prompt (English) | Generated audio | ||
---|---|---|---|---|
Voicebox | ELaTE | EmoCtrl-TTS | ||
Our team suffered a huge defeat today. I deeply regret holding everyone back. | ||||
Ever since she became depressed, every day has been gloomy and painful. I want to help, but I don’t know what to do. | ||||
Ah, last night, I got into a car accident and the other person passed away. It’s so painful to be alive, I can’t help it. | ||||
I ruined an important friendship. Why did I do such a thing? | ||||
Ugh, my brother drowned in the sea yesterday. I cried all night in grief. |
Emotional speech-to-speech translation
EmoCtrl-TTS can be applied to speech-to-speech translation, transferring not only the voice characteristic but also the precise nuance of the source audio. The source audios were sampled from the JNVN dataset, which is a Japanese staged emotional speech corpus.
Emotion | Source audio (Japanese) | Translated audio (English) | |||
---|---|---|---|---|---|
SeamlessExpressive(*) | Voicebox(**) | ELaTE(**) | EmoCtrl-TTS(**) | ||
Happy | |||||
Sad | |||||
Angry | |||||
Surprised | |||||
Disgusted | |||||
Fearful | |||||
(*) We used Seamless Expressive for a pure research purpose. Seamless Expressive was used based on the Seamless Licensing Agreement. Copyright © Meta Platforms, Inc. All Rights Reserved.
(**) We used Whisper to transcribe the speech, and then applied GPT-4 to translate the transcription to English.
Ethics statement
EmoCtrl-TTS is purely a research project. Currently, we have no plans to incorporate EmoCtrl-TTS into a product or expand access to the public. EmoCtrl-TTS could synthesize speech that maintains speaker identity and could be used for educational learning, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chatbot, and so on. While EmoCtrl-TTS can speak in a voice like the voice talent, the similarity, and naturalness depend on the length and quality of the speech prompt, the background noise, as well as other factors. It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agrees to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model. If you suspect that EmoCtrl-TTS is being used in a manner that is abusive or illegal or infringes on your rights or the rights of other people, you can report it at the Report Abuse Portal.