We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separated encoders to preserve the speaker’s voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing.
Audio Clips
Source
CVSS-T (Groundtruth Text/Voice Cloning)
TransVIP
Groundtruth Text Label /
TransVIP Text Out
—————————————————
Before contacting us, check out our frequently asked questions.
Before contacting us, let’s look at the questions we have.
—————————————————
The Court of Auditors told you that.
The Court of Auditors told you that.
—————————————————
The Government understood it.
The Government understood it.
—————————————————
if they come or not.
Whether he’s coming or not.
—————————————————
The Front de gauche member of parliaments don’t understand you.
The parliamentarians of the front-left do not understand you.
—————————————————
It is true, and that’s an absolute positive step forward.
It’s accurate! And it’s a completely positive step.