
Speech to Speech Translation System with Voice and Isochrony Preservation

We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separated encoders to preserve the speaker’s voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing.

Overview of our speech to speech translation framework, which consists of 1) Joint encoder-decoder model for translating speech into target text, and coarse-grained speech tokens, 2) Non-autoregressive acoustic model for acoustic details; 3) Codec model to convert discrete speech tokens back to waveform.
Overview of our speech to speech translation framework, which consists of 1) Joint encoder-decoder model for translating speech into target text, and coarse-grained speech tokens,
2) Non-autoregressive acoustic model for acoustic details; 3) Codec model to convert discrete speech tokens back to waveform.
Audio Clips


CVSS-T (Groundtruth Text/Voice Cloning)


Groundtruth Text Label /
TransVIP Text Out


Before contacting us, check out our frequently asked questions.

Before contacting us, let’s look at the questions we have.


The Court of Auditors told you that.

The Court of Auditors told you that.


The Government understood it.

The Government understood it.


if they come or not.

Whether he’s coming or not.


The Front de gauche member of parliaments don’t understand you.

The parliamentarians of the front-left do not understand you.


It is true, and that’s an absolute positive step forward.

It’s accurate! And it’s a completely positive step.