E2 TTS

Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

E2 TTS (Embarrassingly Easy TTS) is a fully non-autoregressive zero-shot text-to-speech (TTS) system capable of generating the voice of any speaker. Despite its extremely simple model architecture and training scheme, E2 TTS achieves human-level naturalness, and state-of-the-art speaker similarity and intelligibility.

State-of-the-art zero-shot TTS with simple architecture

E2 TTS consists of only two modules: the flow-matching Transformer and the vocoder. The input is a sequence of characters with filler tokens. It does not include any additional components such as a duration model or a grapheme-to-phoneme converter, nor does it use complex techniques like monotonic alignment search or cross-attention in a specific architecture.

E2 TTS System Overview

E2 TTS is a zero-shot TTS system that can generate a voice of any speaker using a short audio sample (a.k.a. an audio prompt).

Audio prompt

Generated speech

E2 TTS has achieved human-level naturalness, and state-of-the-art speaker similarity and intelligibility that are comparable to or surpass previous models.

Generated speech by Voicebox (prior work)

Generated speech by E2 TTS (our work)

Superior flexibility and controllability

E2 TTS can generate natural emotional speech

Happy

Angry

Sad

Disgust

E2 TTS can change the speed of speech

0.7x

1.0x

1.3x

E2 TTS can explicitly specify the pronunciation of the word

During our dinner, we enjoyed a bottle of sake, which complemented our sushi perfectly.

During our dinner, we enjoyed a bottle of (S AA1 K EH0), which complemented our (S UW1 SH IY0) perfectly.

Audio samples

Below, we included audio samples demonstrating how E2 TTS performs. The speech samples were taken from LibriSpeech-PC test-clean and RAVDESS dataset. The speech samples below are provided for the sole purpose of illustrating E2 TTS.

LibriSpeech-PC

All samples in this section are generated using audio prompts and text prompts from the LibriSpeech-PC test-clean set.

Audio prompt Text prompt Ground truth Generated audio
VALL-E Voicebox NaturalSpeech3 E2 TTS
“Isn’t he splendid”! cried Jasper in intense pride, swelling up. “Father knew how to do it”.
My wife, on the spur of the moment, managed to give the gentlemen a very good dinner.
If he, to keep one oath, must lose one joy, by his life’s star foretold.
But, John, there’s no society – just elementary work
Oh, what a record to read; what a picture to gaze upon; how awful the fact!
The real human division is this: the luminous and the shady.
Captain Martin said: ‘I shall give you a pistol to help protect yourself if worse comes to worst!’

RAVDESS

All samples in this section are generated using audio prompts from the RAVDESS dataset. Text prompt is generated using Copilot.

Emotion Audio prompt Text prompt Generated audio
Voicebox E2 TTS
Neutral
So, I was, like, at the, um, grocery store, and, uh, I saw this, like, really yummy-looking, um, cake, y’know? And I, uh, totally wanted to, like, buy it, but, um, I was, like, on a diet, so, uh, I just, like, stared at it for a while, y’know?
Calm
Happy
Sad
Angry
Fearful
Disgust
Surprised
Neutral
I was, like, talking to my friend, and she’s all, um, excited about her, uh, trip to Europe, and I’m just, like, so jealous, right?
Calm
Happy
Sad
Angry
Fearful
Disgust
Surprised

Hard sentences

E2 TTS can generate hard sentences from ELLA-V (opens in new tab). The following samples are generated using audio prompts from the LibriSpeech-PC test-clean set without cherry-picking.

Audio prompt Text prompt E2 TTS generated audio
Active artists always appreciate artistic achievements and applaud awesome artworks.
Brave bakers boldly baked big batches of brownies in beautiful bakeries.
Daring dancers dazzled during dynamic dance displays, drawing delighted crowds.
Excited engineers eagerly enjoyed exploring enormous engineering exhibits.
Friendly farmers faithfully fostered fields, favoring fruitful crops.
Gallant gophers gracefully gambled golden gooseberries on grandiose glaciers.
Happy hikers harmoniously hiked through hilly landscapes on heavenly holidays.
Inquisitive individuals ingeniously invented innovative inventions.
Jovial joggers joyfully joined jogging jaunts, justifying joyful jolliness.
Keen kids keenly knitted knotted knots in kindergartens.
F one F two F four F eight H sixteen H thirty two H sixty four.
Clever cats carefully crafted colorful collages creating cheerful compositions.

Changing the speech rate

E2 TTS allows the modification of the speech rate by manipulating the total input duration.

Audio prompt Text prompt Speech rate
0.7x 1.0x 1.3x
He gave way to the others very readily and retreated unperceived by the Squire and Mistress Fitzooth to the rear of the tent.
“How cheerfully he seems to grin, How neatly spread his claws, And welcome little fishes in With gently smiling jaws”!
Yes; then something better, something still grander, will surely follow, or wherefore should they thus ornament me?
And, though I have grown serene And strong since then, I think that God has willed A still renewable fear…
He wore blue silk stockings, blue knee pants with gold buckles, a blue ruffled waist and a jacket of bright blue braided with gold.
Not only this, but on the table I found a small ball of black dough or clay, with specks of something which looks like sawdust in it.

Specifying the pronunciation without model re-training

E2 TTS allows users to specify the pronunciation of words based on their phoneme sequence.

Text prompt E2 TTS generated audio
I enjoyed a day in Besiktas, Istanbul.
I enjoyed a day in (B EH1 SH IH0 K T AA0 SH), (IH0 S T AA1 N B UH0 L).
During our dinner, we enjoyed a bottle of sake, which complemented our sushi perfectly.
During our dinner, we enjoyed a bottle of (S AA1 K EH0), which complemented our (S UW1 SH IY0) perfectly.
The Qin Dynasty is renowned for beginning the construction of the Great Wall of China.
The (CH IH1 N) Dynasty is renowned for beginning the construction of the Great Wall of China.
At the concert, Raj, the drummer, received huge applause.
At the concert, (R AA1 J), the drummer, received huge applause.
Whether you say ‘(T AH0 M EY1 T OW0)’ or ‘(T AH0 M AA1 T OW0)’, we can all agree that they’re essential for a good salad.
It’s interesting how ‘vase’ can be ‘(V AA1 Z)’, ‘(V EY1 S)’, or ‘(V AE1 S)’ depending on your accent.
No matter if you say ‘(P IH0 K AA1 N)’, ‘(P IY1 K AE0 N)’, or ‘(P AH0 K AE1 N)’, it is my favorite snack.

Ethics statement

E2 TTS is purely a research project. Currently, we have no plans to incorporate E2 TTS into a product or expand access to the public. E2 TTS could synthesize speech that maintains speaker identity and could be used for educational learning, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chatbot, and so on. While E2 TTS can speak in a voice like the voice talent, the similarity, and naturalness depend on the length and quality of the speech prompt, the background noise, as well as other factors. It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agrees to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model. If you suspect that E2 TTS is being used in a manner that is abusive or illegal or infringes on your rights or the rights of other people, you can report it at the Report Abuse Portal.