{"id":1040145,"date":"2024-05-27T14:28:57","date_gmt":"2024-05-27T21:28:57","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=1040145"},"modified":"2024-05-29T19:17:44","modified_gmt":"2024-05-30T02:17:44","slug":"transvip","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/transvip\/","title":{"rendered":"TransVIP"},"content":{"rendered":"
\n\t
\n\t\t
\n\t\t\t\t\t<\/div>\n\t\t\n\t\t
\n\t\t\t\n\t\t\t
\n\t\t\t\t\n\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n

TransVIP<\/h1>\n\n\n\n

Speech to Speech Translation System with Voice and Isochrony Preservation<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n

We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separated encoders to preserve the speaker\u2019s voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing.<\/p>\n\n\n\n

<\/div>\n\n\n\n
\n
Paper’s Link<\/a><\/div>\n<\/div>\n\n\n\n
\"Overview
Overview of our speech to speech translation framework, which consists of 1) Joint encoder-decoder model for translating speech into target text, and coarse-grained speech tokens,
2) Non-autoregressive acoustic model for acoustic details; 3) Codec model to convert discrete speech tokens back to waveform.<\/figcaption><\/figure>\n\n\n\n
<\/div>\n\n\n\n
Audio Clips<\/h5>\n\n\n\n

<\/p>\n\n\n\n

\n
\n
\n
\n

Source<\/p>\n<\/div>\n\n\n\n

\n

CVSS-T (Groundtruth Text\/Voice Cloning)<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n

\n
\n
\n

TransVIP<\/p>\n<\/div>\n\n\n\n

\n

Groundtruth Text Label \/
TransVIP Text Out<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n

\n
\n

\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014<\/p>\n\n\n\n

\n
\n
\n
\n