{"id":924168,"date":"2023-03-01T14:47:27","date_gmt":"2023-03-01T22:47:27","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/"},"modified":"2023-03-22T14:27:51","modified_gmt":"2023-03-22T21:27:51","slug":"multi-view-learning-for-speech-emotion-recognition","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/multi-view-learning-for-speech-emotion-recognition\/","title":{"rendered":"Multi-View Learning for Speech Emotion Recognition"},"content":{"rendered":"
Psychological research has postulated that emotions and <\/span>sentiment are correlated to dimensional scores of valence, <\/span>arousal, and dominance.<\/span> However, the literature of Speech <\/span>Emotion Recognition focuses on independently predicting<\/span> <\/p>\n <\/p>\n <\/p>\n <\/p>\n The outer circle and emotions are the circumplex <\/span>model of affect, adapted from Posner et al., 2005.<\/span> The<\/span> x<\/span>-axis represents <\/span>valence while the<\/span> y<\/span>-axis represents arousal. We plotted inside <\/span>the circle the mean valence and arousal values and their stan<\/span>dard deviations of 8 emotions from the MSP-Podcast dataset.<\/span><\/p>\n <\/p>\n <\/p>\n <\/p>\n <\/p>\n <\/p>\n Multi-view framework for prediction of Dimensional <\/span>scores, Sentiment Categories, and Emotion Categories. The <\/span>output layer varies per model: three regression outputs for <\/span>three Dimensional classes (valence, arousal and dominance); <\/span>three classes for sentiment (<\/span>Pos<\/span>,<\/span> Neu<\/span>,<\/span> Neg<\/span>); five classes for <\/span>Emotions; and the combinations (Dimensional + Sentiment, <\/span>Dimensional + Emotion, or all 3).<\/span><\/p>\n <\/p>\n <\/p>\n <\/p>\n","protected":false},"excerpt":{"rendered":" Psychological research has postulated that emotions and sentiment are correlated to dimensional scores of valence, arousal, and dominance. However, the literature of Speech Emotion Recognition focuses on independently predictingthe three of them for a given speech audio. In this paper, we evaluate and quantify the predictive power of the dimensional scores towards categorical emotions and […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[243062],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-924168","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-audio-acoustics","msr-locale-en_us"],"msr_publishername":"IEEE","msr_edition":"","msr_affiliation":"","msr_published_date":"2023-6-1","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/03\/ICASSP_2023_MultiView_Speech_Emotion_Tompkins.pdf","id":"929850","title":"icassp_2023_multiview_speech_emotion_tompkins","label_id":"243109","label":0}],"msr_related_uploader":"","msr_attachments":[{"id":929850,"url":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/03\/ICASSP_2023_MultiView_Speech_Emotion_Tompkins.pdf"},{"id":924180,"url":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/03\/ICASSP_2023_Multiview_Learning_Speech_Emotion_Recognition.pdf"}],"msr-author-ordering":[{"type":"text","value":"Daniel Tompkins","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Dimitra Emmanouilidou","user_id":37461,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Dimitra Emmanouilidou"},{"type":"user_nicename","value":"Soham Deshmukh","user_id":40312,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Soham Deshmukh"},{"type":"user_nicename","value":"Benjamin Elizalde","user_id":41662,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Benjamin Elizalde"}],"msr_impact_theme":[],"msr_research_lab":[199565],"msr_event":[],"msr_group":[144923,702211],"msr_project":[559086],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":559086,"post_title":"Audio Analytics","post_name":"audio-analytics","post_type":"msr-project","post_date":"2019-02-08 15:57:54","post_modified":"2023-01-13 13:28:08","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/audio-analytics\/","post_excerpt":"Audio analytics is about analyzing and understanding audio signals captured by digital devices, with numerous applications in enterprise, healthcare, productivity, and smart cities.","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/559086"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/924168"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":3,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/924168\/revisions"}],"predecessor-version":[{"id":924201,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/924168\/revisions\/924201"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=924168"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=924168"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=924168"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=924168"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=924168"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=924168"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=924168"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=924168"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=924168"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=924168"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=924168"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=924168"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=924168"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=924168"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=924168"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=924168"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}
the three of them for a given speech audio. In this paper, we <\/span>evaluate and quantify the predictive power of the dimensional <\/span>scores towards categorical emotions and sentiment for two <\/span>publicly available speech emotion datasets.<\/span> We utilize the <\/span>three emotional views in a joined multi-view training frame<\/span>work. The views comprise the dimensional scores, emotions <\/span>categories, and sentiment categories. We present a compar<\/span>ison for each emotional view or combination of, utilizing <\/span>two general-purpose models for speech-related applications: <\/span>CNN14 and Wav2Vec2.<\/span> To our knowledge this is the first <\/span>time such a joint framework is explored.<\/span> We found that a <\/span>joined multi-view training framework can produce results as <\/span>strong or stronger than models trained independently for each <\/span>view.<\/span><\/p>\n