{"id":924168,"date":"2023-03-01T14:47:27","date_gmt":"2023-03-01T22:47:27","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/"},"modified":"2023-03-22T14:27:51","modified_gmt":"2023-03-22T21:27:51","slug":"multi-view-learning-for-speech-emotion-recognition","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/multi-view-learning-for-speech-emotion-recognition\/","title":{"rendered":"Multi-View Learning for Speech Emotion Recognition"},"content":{"rendered":"<p><span dir=\"ltr\" role=\"presentation\">Psychological research has postulated that emotions and <\/span><span dir=\"ltr\" role=\"presentation\">sentiment are correlated to dimensional scores of valence, <\/span><span dir=\"ltr\" role=\"presentation\">arousal, and dominance.<\/span> <span dir=\"ltr\" role=\"presentation\">However, the literature of Speech <\/span><span dir=\"ltr\" role=\"presentation\">Emotion Recognition focuses on independently predicting<\/span><br role=\"presentation\" \/><span dir=\"ltr\" role=\"presentation\">the three of them for a given speech audio. In this paper, we <\/span><span dir=\"ltr\" role=\"presentation\">evaluate and quantify the predictive power of the dimensional <\/span><span dir=\"ltr\" role=\"presentation\">scores towards categorical emotions and sentiment for two <\/span><span dir=\"ltr\" role=\"presentation\">publicly available speech emotion datasets.<\/span> <span dir=\"ltr\" role=\"presentation\">We utilize the <\/span><span dir=\"ltr\" role=\"presentation\">three emotional views in a joined multi-view training frame<\/span><span dir=\"ltr\" role=\"presentation\">work. The views comprise the dimensional scores, emotions <\/span><span dir=\"ltr\" role=\"presentation\">categories, and sentiment categories. We present a compar<\/span><span dir=\"ltr\" role=\"presentation\">ison for each emotional view or combination of, utilizing <\/span><span dir=\"ltr\" role=\"presentation\">two general-purpose models for speech-related applications: <\/span><span dir=\"ltr\" role=\"presentation\">CNN14 and Wav2Vec2.<\/span> <span dir=\"ltr\" role=\"presentation\">To our knowledge this is the first <\/span><span dir=\"ltr\" role=\"presentation\">time such a joint framework is explored.<\/span> <span dir=\"ltr\" role=\"presentation\">We found that a <\/span><span dir=\"ltr\" role=\"presentation\">joined multi-view training framework can produce results as <\/span><span dir=\"ltr\" role=\"presentation\">strong or stronger than models trained independently for each <\/span><span dir=\"ltr\" role=\"presentation\">view.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-924198 alignleft\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/combined_circumplex-3-1024x752.png\" alt=\"adapted circumplex model of affect\" width=\"390\" height=\"287\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/combined_circumplex-3-1024x752.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/combined_circumplex-3-300x220.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/combined_circumplex-3-768x564.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/combined_circumplex-3-1536x1127.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/combined_circumplex-3-2048x1503.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/combined_circumplex-3-80x60.png 80w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/combined_circumplex-3-240x176.png 240w\" sizes=\"auto, (max-width: 390px) 100vw, 390px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p><span dir=\"ltr\" role=\"presentation\">The outer circle and emotions are the circumplex <\/span><span dir=\"ltr\" role=\"presentation\">model of affect, adapted from Posner et al., 2005.<\/span> <span dir=\"ltr\" role=\"presentation\">The<\/span> <span dir=\"ltr\" role=\"presentation\">x<\/span><span dir=\"ltr\" role=\"presentation\">-axis represents <\/span><span dir=\"ltr\" role=\"presentation\">valence while the<\/span> <span dir=\"ltr\" role=\"presentation\">y<\/span><span dir=\"ltr\" role=\"presentation\">-axis represents arousal. We plotted inside <\/span><span dir=\"ltr\" role=\"presentation\">the circle the mean valence and arousal values and their stan<\/span><span dir=\"ltr\" role=\"presentation\">dard deviations of 8 emotions from the MSP-Podcast dataset.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-924186 alignleft\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/revised_fig-1-300x175.png\" alt=\"diagram of architecture\" width=\"396\" height=\"231\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/revised_fig-1-300x175.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/revised_fig-1-480x280.png 480w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/revised_fig-1-240x140.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/revised_fig-1.png 721w\" sizes=\"auto, (max-width: 396px) 100vw, 396px\" \/><span dir=\"ltr\" role=\"presentation\">Multi-view framework for prediction of Dimensional <\/span><span dir=\"ltr\" role=\"presentation\">scores, Sentiment Categories, and Emotion Categories. The <\/span><span dir=\"ltr\" role=\"presentation\">output layer varies per model: three regression outputs for <\/span><span dir=\"ltr\" role=\"presentation\">three Dimensional classes (valence, arousal and dominance); <\/span><span dir=\"ltr\" role=\"presentation\">three classes for sentiment (<\/span><span dir=\"ltr\" role=\"presentation\">Pos<\/span><span dir=\"ltr\" role=\"presentation\">,<\/span> <span dir=\"ltr\" role=\"presentation\">Neu<\/span><span dir=\"ltr\" role=\"presentation\">,<\/span> <span dir=\"ltr\" role=\"presentation\">Neg<\/span><span dir=\"ltr\" role=\"presentation\">); five classes for <\/span><span dir=\"ltr\" role=\"presentation\">Emotions; and the combinations (Dimensional + Sentiment, <\/span><span dir=\"ltr\" role=\"presentation\">Dimensional + Emotion, or all 3).<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Psychological research has postulated that emotions and sentiment are correlated to dimensional scores of valence, arousal, and dominance. However, the literature of Speech Emotion Recognition focuses on independently predictingthe three of them for a given speech audio. In this paper, we evaluate and quantify the predictive power of the dimensional scores towards categorical emotions and [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[243062],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-924168","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-audio-acoustics","msr-locale-en_us"],"msr_publishername":"IEEE","msr_edition":"","msr_affiliation":"","msr_published_date":"2023-6-1","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/ICASSP_2023_MultiView_Speech_Emotion_Tompkins.pdf","id":"929850","title":"icassp_2023_multiview_speech_emotion_tompkins","label_id":"243109","label":0}],"msr_related_uploader":"","msr_attachments":[{"id":929850,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/ICASSP_2023_MultiView_Speech_Emotion_Tompkins.pdf"},{"id":924180,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/ICASSP_2023_Multiview_Learning_Speech_Emotion_Recognition.pdf"}],"msr-author-ordering":[{"type":"text","value":"Daniel Tompkins","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Dimitra Emmanouilidou","user_id":37461,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Dimitra Emmanouilidou"},{"type":"user_nicename","value":"Soham Deshmukh","user_id":40312,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Soham Deshmukh"},{"type":"user_nicename","value":"Benjamin Elizalde","user_id":41662,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Benjamin Elizalde"}],"msr_impact_theme":[],"msr_research_lab":[199565],"msr_event":[],"msr_group":[144923,702211],"msr_project":[559086],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":559086,"post_title":"Audio Analytics","post_name":"audio-analytics","post_type":"msr-project","post_date":"2019-02-08 15:57:54","post_modified":"2023-01-13 13:28:08","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/audio-analytics\/","post_excerpt":"Audio analytics is about analyzing and understanding audio signals captured by digital devices, with numerous applications in enterprise, healthcare, productivity, and smart cities.","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/559086"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/924168","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":3,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/924168\/revisions"}],"predecessor-version":[{"id":924201,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/924168\/revisions\/924201"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=924168"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=924168"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=924168"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=924168"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=924168"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=924168"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=924168"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=924168"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=924168"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=924168"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=924168"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=924168"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=924168"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=924168"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=924168"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=924168"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}