{"id":630420,"date":"2020-01-09T15:52:36","date_gmt":"2020-01-09T23:52:36","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&#038;p=630420"},"modified":"2020-01-09T15:52:36","modified_gmt":"2020-01-09T23:52:36","slug":"cnn-with-phonetic-attention-for-text-independent-speaker-verification","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/cnn-with-phonetic-attention-for-text-independent-speaker-verification\/","title":{"rendered":"CNN with phonetic attention for text-independent speaker verification"},"content":{"rendered":"<p>Text-independent speaker verification imposes no constraints<br \/>\non the spoken content and usually needs long observations<br \/>\nto make reliable prediction. In this paper, we propose two<br \/>\nspeaker embedding approaches by integrating the phonetic information<br \/>\ninto the attention-based residual convolutional neural<br \/>\nnetwork (CNN). Phonetic features are extracted from the<br \/>\nbottleneck layer of a pretrained acoustic model. In implicit<br \/>\nphonetic attention (IPA), the phonetic features are projected<br \/>\nby a transformation network into multi-channel feature maps,<br \/>\nand then concatenated with the raw acoustic features as the input<br \/>\nof the CNN network. In explicit phonetic attention (EPA),<br \/>\nthe phonetic features are directly connected to the attentive<br \/>\npooling layer through a separate 1-dim CNN to generate the<br \/>\nattention weights. With the incorporation of spoken content<br \/>\nand attention mechanism, the system can not only distill the<br \/>\nspeaker-discriminant frames but also actively normalize the<br \/>\nphonetic variations. Multi-head attention and discriminative<br \/>\nobjectives are further studied to improve the system. Experiments<br \/>\non the VoxCeleb corpus show our proposed system<br \/>\ncould outperform the state-of-the-art by around 43% relative.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Text-independent speaker verification imposes no constraints on the spoken content and usually needs long observations to make reliable prediction. In this paper, we propose two speaker embedding approaches by integrating the phonetic information into the attention-based residual convolutional neural network (CNN). Phonetic features are extracted from the bottleneck layer of a pretrained acoustic model. In [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"IEEE","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"Automatic Speech Recognition and Understanding Workshop","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2019-12","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[13545],"msr-publication-type":[193716],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-630420","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-human-language-technologies","msr-locale-en_us"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2019-12","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"IEEE","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/phonetic_attention_ASRU_2019.pdf","id":"630423","title":"phonetic_attention_asru_2019","label_id":"243109","label":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[{"id":630423,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/phonetic_attention_ASRU_2019.pdf"}],"msr-author-ordering":[{"type":"text","value":"Tianyan Zhou","user_id":0,"rest_url":false},{"type":"text","value":"yong zhao","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Jinyu Li","user_id":32312,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Jinyu Li"},{"type":"user_nicename","value":"Yifan Gong","user_id":34994,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Yifan Gong"},{"type":"text","value":"Jian Wu","user_id":0,"rest_url":false}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[],"msr_project":[],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"inproceedings","related_content":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/630420","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":1,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/630420\/revisions"}],"predecessor-version":[{"id":630429,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/630420\/revisions\/630429"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=630420"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=630420"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=630420"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=630420"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=630420"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=630420"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=630420"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=630420"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=630420"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=630420"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=630420"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=630420"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=630420"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}