EmoCtrl-TTS

Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

EmoCtrl-TTS is an emotion-controllable zero-shot TTS that can generate highly emotional speech with non-verbal vocalizations such as laughter and crying for any speaker. EmoCtrl-TTS is purely a research project. Currently, we have no plans to incorporate EmoCtrl-TTS into a product or expand access to the public.

Controlling time-varying emotional states of zero-shot text-to-speech

EmoCtrl-TTS utilizes embeddings that represent emotion and non-verbal vocalizations to condition the flow-matching-based zero-shot TTS. In order to generate high-quality emotional speech, EmoCtrl-TTS is trained with over 27,000 hours of expressive data, curated using pseudo-labeling.

Overview

EmoCtrl-TTS can generate a voice of any speaker with non-verbal vocalizations like laughter and crying.

Generated speech samples

EmoCtrl-TTS is specifically designed to capture the time-varying emotional states found in the audio prompt.

Audio prompt (Angry → Calm)

Generate speech by Voicebox (prior work)

Generated speech by EmoCtrl-TTS (our work)

Audio samples

Below, we included audio samples demonstrating how EmoCtrl-TTS performs. The speech samples were taken from JVNV dataset, DiariST-AliMeeting dataset, and RAVDESS dataset. The speech samples below are provided for the sole purpose of illustrating EmoCtrl-TTS.

Capturing the time-varying emotional states

EmoCtrl-TTS can generate a speech that closely mimics the time-varying emotional states found in the audio prompt. In these demo samples, the audio prompt is created by concatenating two audio samples from RAVDESS data set. The text prompt is “dogs are sitting by the door dogs are sitting by the door” for all generated speech samples.

Emotion Audio prompt Generated audio
Voicebox ELaTE EmoCtrl-TTS
Angry → Calm
Sad → Surprised
Happy → Disgusted
Calm → Fearful

Generating non-verbal vocalization

EmoCtrl-TTS can generate non-verbal vocalizations, such as laughter and crying, that closely match the audio prompt.

Laughing speech generation

(Audio prompt from AliMeeting-DiariST dataset; real conversational speech in Chinese)

Audio prompt (Chinese) Text prompt (English) Generated audio
Voicebox ELaTE EmoCtrl-TTS
Ah, look, right, isn’t it? At a glance, oh, yes, then maybe play for a while. Oh, maybe we’ll be fine.
You remind me of the kitchen knives sold in the morning market.
But I think buying these financial products won’t be fooled.
But don’t you think after seeing that number you feel very panicked and very uncomfortable inside?
You take a look at your share first.
Crying speech generation

(Audio prompt from JVNV dataset; staged speech in Japanese)

Audio prompt (Japanese) Text prompt (English) Generated audio
Voicebox ELaTE EmoCtrl-TTS
Our team suffered a huge defeat today. I deeply regret holding everyone back.
Ever since she became depressed, every day has been gloomy and painful. I want to help, but I don’t know what to do.
Ah, last night, I got into a car accident and the other person passed away. It’s so painful to be alive, I can’t help it.
I ruined an important friendship. Why did I do such a thing?
Ugh, my brother drowned in the sea yesterday. I cried all night in grief.

Emotional speech-to-speech translation

EmoCtrl-TTS can be applied to speech-to-speech translation, transferring not only the voice characteristic but also the precise nuance of the source audio. The source audios were sampled from the JNVN dataset, which is a Japanese staged emotional speech corpus.

Emotion Source audio (Japanese) Translated audio (English)
SeamlessExpressive(*) Voicebox(**) ELaTE(**) EmoCtrl-TTS(**)
Happy
Sad
Angry
Surprised
Disgusted
Fearful

(*) We used Seamless Expressive for a pure research purpose. Seamless Expressive was used based on the Seamless Licensing Agreement. Copyright © Meta Platforms, Inc. All Rights Reserved.
(**) We used Whisper to transcribe the speech, and then applied GPT-4 to translate the transcription to English.

Ethics statement

EmoCtrl-TTS is purely a research project. Currently, we have no plans to incorporate EmoCtrl-TTS into a product or expand access to the public. EmoCtrl-TTS could synthesize speech that maintains speaker identity and could be used for educational learning, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chatbot, and so on. While EmoCtrl-TTS can speak in a voice like the voice talent, the similarity, and naturalness depend on the length and quality of the speech prompt, the background noise, as well as other factors. It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agrees to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model. If you suspect that EmoCtrl-TTS is being used in a manner that is abusive or illegal or infringes on your rights or the rights of other people, you can report it at the Report Abuse Portal.