{"id":959313,"date":"2023-08-13T16:18:27","date_gmt":"2023-08-13T23:18:27","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=959313"},"modified":"2023-08-27T11:11:24","modified_gmt":"2023-08-27T18:11:24","slug":"speechx","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/speechx\/","title":{"rendered":"SpeechX"},"content":{"rendered":"
\n\t
\n\t\t
\n\t\t\t\"waveform\"\t\t<\/div>\n\t\t\n\t\t
\n\t\t\t\n\t\t\t
\n\t\t\t\t\n\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n

SpeechX<\/h1>\n\n\n\n

Neural Codec Language Model as a Versatile Speech Transformer<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n

SpeechX is a versatile speech generation model leveraging audio and text prompts, which can deal with both clean and noisy speech inputs and perform zero-shot TTS and various tasks involving transforming the input speech. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting. This enables unified treatment of various tasks in an extensible manner, providing a consistent way of leveraging text input for speech enhancement and transformation.  The current model, trained on 60K hours of speech audio, can perform zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing, where the spoken content can be altered while preserving the speaker and background sounds.<\/p>\n\n\n\n

\n
Read the paper<\/a><\/div>\n<\/div>\n\n\n\n
<\/div>\n\n\n\n

Model overview<\/h2>\n\n\n\n

SpeechX is a neural codec language model based on audio and text prompts and incorporates special tokens for task-dependent prompting, which helps the model to determine what a desired output is. The current model was trained on 60K hours of data from LibriLight.<\/p>\n\n\n\n

\"SpeechX<\/figure>\n\n\n\n
<\/div>\n\n\n\n

Multiple tasks with one model<\/h2>\n\n\n\n

SpeechX deals with various input-output transformation relationships by employing a generic language modeling architecture using acoustic and textual tokens.<\/p>\n\n\n\n

\"Multiple<\/figure>\n\n\n\n
<\/div>\n\n\n\n

Applications \/ demo<\/h2>\n\n\n\n

Below, we included audio samples demonstrating how SpeechX performs in various speech-processing tasks. The audio files were normalized in amplitude and resampled at 16 kHz for listening. The speech samples and transcripts were taken from LibriSpeech test-clean. The speech samples below are provided for the sole purpose of illustrating SpeechX.<\/p>\n\n\n\n

<\/div>\n\n\n\nSpeech Generation Tasks<\/title>\n\n\n<div style=\"width: 100%;margin: 0 auto\">\n\n <!-- Zero-shot TTS -->\n <div style=\"margin-bottom: 50px\">\n <h5>Zero-shot TTS (Text To Speech)<\/h5>\n <p>SpeechX synthesizes speech in the style specified by an audio prompt<\/p>\n <div style=\"border-bottom: 2px solid black;margin-bottom: 2px\"><\/div>\n <div style=\"background-color: #E6E6FA;padding: 20px;border-radius: 5px;max-width: 80%;margin: 20px auto\">\n\n <table style=\"width: 100%;border-collapse: collapse;border: 1px solid;border: none\">\n <thead>\n <tr style=\"border-bottom: 2px solid black\">\n <th style=\"text-align: center;padding: 8px;width:25%\">Text<\/th>\n <th style=\"text-align: center;padding: 8px;width:25%\">Prompt (speaker)<\/th>\n <th style=\"text-align: center;padding: 8px;width:25%\">SpeechX output<\/th>\n <th style=\"text-align: center;padding: 8px;width:25%\">Ground truth<\/th>\n <\/tr>\n <\/thead>\n <tbody>\n \n <!--2-->\n <tr>\n <td style=\"text-align: left;padding: 8px\">miss de graf said kenneth noticing the boy’s face\n critically as he stood where the light from the passage fell upon it<\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tts_sample6_6829-68769-0018_enroll_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tts_sample6_6829-68769-0018_predicted_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tts_sample6_6829-68769-0018_clean_ref_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n<!-- 1-->\n <tr>\n <td style=\"text-align: left;padding: 8px\">the paris plant like that at the crystal palace was a\n temporary exhibit<\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tts_sample3_2300-131720-0000_enroll_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tts_sample3_2300-131720-0000_predicted_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tts_sample3_2300-131720-0000_clean_ref_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <!-- 3-->\n <tr>\n <td style=\"text-align: left;padding: 8px\">that summer’s emigration however being mainly from the\n free states greatly changed the relative strength of the two parties<\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tts_sample2_7729-102255-0002_enroll_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tts_sample2_7729-102255-0002_predicted_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tts_sample2_7729-102255-0002_clean_ref_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <tr>\n <td style=\"text-align: left;padding: 8px\">it is my heart hung in the sky and no clouds ever\n float between the grave flowers and my heart on high<\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tts_sample1_8555-292519-0007_enroll_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tts_sample1_8555-292519-0007_predicted_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tts_sample1_8555-292519-0007_clean_ref_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n\n\n\n <!-- ... Add more rows as necessary ... -->\n <\/tbody>\n <\/table>\n <\/div>\n <\/div>\n\n <!-- Clean speech editing -->\n <div style=\"margin-bottom: 50px\">\n <h5>Spoken content editing<\/h5>\n <p>SpeechX helps correct misspoken words.<\/p>\n <div style=\"border-bottom: 2px solid black;margin-bottom: 2px\"><\/div>\n <div style=\"background-color: #E6E6FA;padding: 20px;border-radius: 5px;max-width: 80%;margin: 20px auto\">\n\n <table style=\"width: 100%;border-collapse: collapse\">\n <thead>\n <tr style=\"border-bottom: 2px solid black\">\n <th style=\"text-align: center;padding: 8px;width:25%\">Original Text<\/th>\n <th style=\"text-align: center;padding: 8px;width:25%\">Edited Text<\/th>\n <th style=\"text-align: center;padding: 8px;width:25%\">Original speech<\/th>\n <th style=\"text-align: center;padding: 8px;width:25%\">SpeechX output<\/th>\n <\/tr>\n <\/thead>\n <tbody>\n <!--3-->\n <tr>\n <td style=\"text-align: left;padding: 8px\">\n\n cotton is a wonderful thing is it not boys she said rather primly\n\n <\/td>\n <td style=\"text-align: left;padding: 8px\">\n\n cotton is a wonderful thing <b> with its soft and breathable texture <\/b> she said rather primly\n\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/1995-1826-0019-pre-edit_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/1995-1826-0019-post-edit_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n \n <tr>\n <td style=\"text-align: left;padding: 8px\">\n\n gold is the most common metal in the land of oz and is used for many purposes because it is soft and pliable\n\n <\/td>\n <td style=\"text-align: left;padding: 8px\"> <b> in the land of oz gold is the prevalent metal and is utilized for various reasons <\/b> because it is soft and pliable.\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/1284-1181-0004-pre-edit_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/1284-1181-0004-post-edit_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n<!--1-->\n <tr>\n <td style=\"text-align: left;padding: 8px\">\n\n its jaw is enormous and according to naturalists it is armed with no less than one hundred\n and eighty two teeth\n\n <\/td>\n <td style=\"text-align: left;padding: 8px\">its jaw is enormous and according to naturalists it is\n armed with no <b> more than five hundred and thirty three <\/b> teeth\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_clean_editing_sample4_260-123286-0028-pre-edit_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_clean_editing_sample4_260-123286-0028-post-edit_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n \n <!--4-->\n <tr>\n <td style=\"text-align: left;padding: 8px\">shame on you citizens cried he i blush for my fellows\n of nottingham<\/td>\n <td style=\"text-align: left;padding: 8px\"><b>citizens you should be ashamed<\/b> cried he i blush\n for my fellows of nottingham<\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_clean_editing_sample1_61-70968-0020-pre-edit_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_clean_editing_sample1_61-70968-0020-post-edit_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <!-- ... Add more rows as necessary ... -->\n <\/tbody>\n <\/table>\n <\/div>\n <\/div>\n <!-- Noisy speech editing -->\n <div style=\"margin-bottom: 50px\">\n <h5>Background-preserving spoken content editing<\/h5>\n <p>SpeechX naturally corrects misspoken words by preserving original ambience<\/p>\n <div style=\"border-bottom: 2px solid black;margin-bottom: 2px\"><\/div>\n <div style=\"background-color: #E6E6FA;padding: 20px;border-radius: 5px;max-width: 80%;margin: 20px auto\">\n\n <table style=\"width: 100%;border-collapse: collapse\">\n <thead>\n <tr style=\"border-bottom: 2px solid black\">\n <th style=\"text-align: center;padding: 8px;width:25%\">Original Text<\/th>\n <th style=\"text-align: center;padding: 8px;width:25%\">Edited Text<\/th>\n <th style=\"text-align: center;padding: 8px;width:25%\">Original speech<\/th>\n <th style=\"text-align: center;padding: 8px;width:25%\">SpeechX output<\/th>\n <\/tr>\n <\/thead>\n <tbody>\n <tr>\n <td style=\"text-align: left;padding: 8px\">we will go out together to the bower there is a way\n down to the court from my window<\/td>\n <td style=\"text-align: left;padding: 8px\">we will go out together to the bower there is <b>a\n pathway down to the courtyard<\/b><\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_noisy_editing_sample1_61-70970-0016-pre-edit_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_noisy_editing_sample1_61-70970-0016-post-edit_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <tr>\n <td style=\"text-align: left;padding: 8px\">she has been dead these twenty years<\/td>\n <td style=\"text-align: left;padding: 8px\">she has been dead <b>thirty nine<\/b> years<\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_noisy_editing_sample2_121-127105-0009-pre-edit_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_noisy_editing_sample2_121-127105-0009-post-edit_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <tr>\n <td style=\"text-align: left;padding: 8px\">\n\n its origin was small a germ an insignificant seed hardly to be thought of as likely to\n arouse opposition\n\n\n\n <\/td>\n <td style=\"text-align: left;padding: 8px\">\n\n its origin was small a germ an insignificant <b> seed always expected to provoke strong <\/b>\n opposition\n\n\n\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_noisy_editing_sample3_4077-13751-0001-pre-edit_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_noisy_editing_sample3_4077-13751-0001-post-edit_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <tr>\n <td style=\"text-align: left;padding: 8px\">\n\n he was a fanatic on formality and he only addressed me in the third person to the point\n where it got tiresome\n\n <\/td>\n <td style=\"text-align: left;padding: 8px\">\n\n he was <b>obsessive about protocol and always spoke to <\/b> me in the third person to the\n point where it got tiresome\n\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_noisy_editing_sample4_8463-294828-0013-pre-edit_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_noisy_editing_sample4_8463-294828-0013-post-edit_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <!-- ... Add more rows as necessary ... -->\n <\/tbody>\n <\/table>\n <\/div>\n <\/div>\n <!-- Noise suppression -->\n <div style=\"margin-bottom: 50px\">\n <h5>Noise suppression<\/h5>\n <p>SpeechX removes unwanted background sounds that have been mixed into your recordings. Text input is optional,\n but it helps.<\/p>\n <div style=\"border-bottom: 2px solid black;margin-bottom: 2px\"><\/div>\n <div style=\"background-color: #E6E6FA;padding: 20px;border-radius: 5px;max-width: 80%;margin: 20px auto\">\n\n <table style=\"width: 100%;border-collapse: collapse\">\n <thead>\n <tr style=\"border-bottom: 2px solid black\">\n <th style=\"text-align: center;padding: 8px;width:25%\">Text<\/th>\n <th style=\"text-align: center;padding: 8px;width:25%\">Noisy speech<\/th>\n <th style=\"text-align: center;padding: 8px;width:25%\">SpeechX output<\/th>\n <th style=\"text-align: center;padding: 8px;width:25%\">Ground truth<\/th>\n <\/tr>\n <\/thead>\n <tbody>\n <tr>\n <td style=\"text-align: left;padding: 8px\">secure as he thought in the careful administration of\n justice in that city and the character of its well disposed inhabitants the good hidalgo was\n far from thinking that any disaster could befal his family<\/td>\n <td style=\"text-align: center;text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_fg_sample8_5639-40744-0001_noisy_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_fg_sample8_5639-40744-0001_prediction_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_fg_sample8_5639-40744-0001_clean_ref_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <tr>\n <td style=\"text-align: left;padding: 8px\">one day when the boy was sent by his grandfather with\n a message to a relation he passed along a street in which there was a great concourse of\n horsemen<\/td>\n <td style=\"text-align: center;text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_fg_sample9_5639-40744-0024_noisy_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_fg_sample9_5639-40744-0024_decompressed_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_fg_sample9_5639-40744-0024_clean_ref_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <tr>\n <td style=\"text-align: left;padding: 8px\">the ideas also remain but they have become types in\n nature forms of men animals birds fishes<\/td>\n <td style=\"text-align: center;text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_fg_sample3_2961-960-0014_noisy_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_fg_sample3_2961-960-0014_prediction_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_fg_sample3_2961-960-0014_clean_ref_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <tr>\n <td style=\"text-align: left;padding: 8px\">the salient features of this development of domestic\n service have already been indicated<\/td>\n <td style=\"text-align: center;text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_fg_sample6_3570-5696-0004_noisy_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_fg_sample6_3570-5696-0004_prediction_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_fg_sample6_3570-5696-0004_clean_ref_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <\/tbody>\n <\/table>\n <\/div>\n <\/div>\n\n <!-- Target speaker extraction -->\n <div style=\"margin-bottom: 50px\">\n <h5>Target speaker extraction<\/h5>\n <p>SpeechX zeros in on one person in a mixture of voices<\/p>\n <div style=\"border-bottom: 2px solid black;margin-bottom: 2px\"><\/div>\n <div style=\"background-color: #E6E6FA;padding: 20px;border-radius: 5px;max-width: 80%;margin: 20px auto\">\n\n <table style=\"width: 100%;border-collapse: collapse\">\n <thead>\n <tr style=\"border-bottom: 2px solid black\">\n <th style=\"text-align: center;padding: 8px;width:40%\">Text<\/th>\n <th style=\"text-align: center;padding: 8px;width:15%\">Prompt (speaker)<\/th>\n <th style=\"text-align: center;padding: 8px;width:15%\">Mixed speech<\/th>\n <th style=\"text-align: center;padding: 8px;width:15%\">SpeechX output<\/th>\n <th style=\"text-align: center;padding: 8px;width:15%\">Ground truth<\/th>\n <\/tr>\n <\/thead>\n <tbody>\n <tr>\n <td style=\"text-align: left;padding: 8px\">i allude to the goddess<\/br><\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample3_7127-75947-0005_enroll_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample3_7127-75947-0005_mixed_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample3_7127-75947-0005_predicted_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample3_7127-75947-0005_clean_ref_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <tr>\n <td style=\"text-align: left;padding: 8px\">at that moment the gentleman entered bearing a huge\n object concealed by a piece of green felt<\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample2_4992-41806-0012_enroll_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample2_4992-41806-0012_mixed_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample2_4992-41806-0012_predicted_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample2_4992-41806-0012_clean_ref_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <tr>\n <td style=\"text-align: left;padding: 8px\">I knew nothing of the doctrine of faith because we\n were taught sophistry instead of certainty and nobody understood spiritual boasting<\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample1_2830-3980-0019_enroll_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample1_2830-3980-0019_mixed_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample1_2830-3980-0019_predicted_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample1_2830-3980-0019_clean_ref_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n\n <tr>\n <td style=\"text-align: left;padding: 8px\">it is my heart hung in the sky and no clouds ever\n float between the grave flowers and my heart on high<\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample4_8555-292519-0007_enroll_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample4_8555-292519-0007_mixed_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample4_8555-292519-0007_predicted_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_tse_sample4_8555-292519-0007_clean_ref_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <!-- ... more rows ... -->\n <\/tbody>\n <\/table>\n <\/div>\n <\/div>\n\n <!-- Speech removal -->\n <div style=\"margin-bottom: 50px\">\n <h5>Speech removal<\/h5>\n <p>SpeechX can naturally erase human voices for audio redaction<\/p>\n <div style=\"border-bottom: 2px solid black;margin-bottom: 2px\"><\/div>\n <div style=\"background-color: #E6E6FA;padding: 20px;border-radius: 5px;max-width: 80%;margin: 20px auto\">\n\n <table style=\"width: 100%;border-collapse: collapse\">\n <thead>\n <tr style=\"border-bottom: 2px solid black\">\n <th style=\"text-align: center;padding: 8px;width:33.33%\">Noisy speech<\/th>\n <th style=\"text-align: center;padding: 8px;width:33.33%\">SpeechX output<\/th>\n <th style=\"text-align: center;padding: 8px;width:33.33%\">Ground truth<\/th>\n <\/tr>\n <\/thead>\n <tbody>\n <tr>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_bg_sample1_61-70968-0049_noisy_normalized_final.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_bg_sample1_61-70968-0049_prediction_normalized_final.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_bg_sample1_61-70968-0049_noise_ref_normalized_final.wav\"><\/audio>\n <\/td>\n <\/tr>\n <tr>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_bg_sample2_61-70968-0005_noisy_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_bg_sample2_61-70968-0005_prediction_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_bg_sample2_61-70968-0005_noise_ref_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <tr>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_bg_sample3_4992-41806-0003_noisy_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_bg_sample3_4992-41806-0003_prediction_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_bg_sample3_4992-41806-0003_noise_ref_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <tr>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_bg_sample4_121-127105-0036_noisy_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_bg_sample4_121-127105-0036_prediction_normalized.wav\"><\/audio>\n <\/td>\n <td style=\"text-align: center;padding: 8px\">\n <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/08\/task_ns_bg_sample4_121-127105-0036_noise_ref_normalized.wav\"><\/audio>\n <\/td>\n <\/tr>\n <!-- ... more rows ... -->\n <\/tbody>\n <\/table>\n <\/div>\n <\/div>\n\n\n\n\n<\/div>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center\" id=\"ethics-statement\">Ethics statement<\/h2>\n\n\n\n<p>SpeechX could synthesize speech that maintains speaker identity and could be used for educational learning, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chatbot, and so on. While SpeechX can speak in a voice like the voice talent, the similarity, and naturalness depend on the length and quality of the speech prompt, the background noise, as well as other factors. It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agrees to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.<\/p>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n","protected":false},"excerpt":{"rendered":"<p>Neural Codec Language Model as a Versatile Speech Transformer SpeechX is a versatile speech generation model leveraging audio and text prompts, which can deal with both clean and noisy speech inputs and perform zero-shot TTS and various tasks involving transforming the input speech. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting. […]<\/p>\n","protected":false},"featured_media":961596,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"research-area":[13556,13545],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-959313","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Xiaofei Wang","user_id":38658,"people_section":"Section name 0","alias":"xiaofewa"},{"type":"user_nicename","display_name":"Manthan Thakker","user_id":39627,"people_section":"Section name 0","alias":"mathakke"},{"type":"user_nicename","display_name":"Naoyuki Kanda","user_id":38661,"people_section":"Section name 0","alias":"nakanda"},{"type":"user_nicename","display_name":"Sefik Emre Eskimez","user_id":38655,"people_section":"Section name 0","alias":"seeskime"},{"type":"user_nicename","display_name":"Shujie Liu","user_id":33634,"people_section":"Section name 0","alias":"shujliu"},{"type":"user_nicename","display_name":"Jinyu Li","user_id":32312,"people_section":"Section name 0","alias":"jinyli"}],"msr_research_lab":[199565],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/959313"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":279,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/959313\/revisions"}],"predecessor-version":[{"id":1006929,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/959313\/revisions\/1006929"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/961596"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=959313"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=959313"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=959313"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=959313"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=959313"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}