{"id":611697,"date":"2019-09-30T16:20:52","date_gmt":"2019-09-30T23:20:52","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&#038;p=611697"},"modified":"2020-06-08T17:13:36","modified_gmt":"2020-06-09T00:13:36","slug":"cure-dataset-ladder-networks-for-audio-event","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/cure-dataset-ladder-networks-for-audio-event\/","title":{"rendered":"CURE Dataset: Ladder Networks for Audio Event Classification"},"content":{"rendered":"<p>Audio event classi\ufb01cation is an important task for several applications such as surveillance, audio, video and multimedia retrieval etc. There are approximately 340 million people with hearing loss who can\u2019t perceive events happening around them. This paper establishes the CURE dataset which contains curated set of speci\ufb01c audio events most relevant for people with hearing loss. It is formatted as 5 sec sound recordings derived from the Freesound project. We propose a ladder network based audio event classi\ufb01er. We adopted the state-of-the-art convolutional neural network (CNN) embeddings as audio features for this task. We start with signal and feature normalization that aims to reduce the mismatch between different recordings scenarios. Initially, a CNN is trained on weakly labeled Audioset data. Next, the pre-trained model is adopted as feature extractor for proposed CURE corpus. We also explore the performance of extreme learning machine (ELM) and use support vector machine (SVM) as baseline classi\ufb01er. As a second evaluation set we incorporate ESC-50. Results and discussions validate the superiority of Ladder network over ELM and SVM classi\ufb01er in terms of robustness and increased classi\ufb01cation accuracy.<\/p>\n<p>&nbsp;<\/p>\n<div id=\"attachment_665001\" style=\"width: 887px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-665001\" class=\"size-full wp-image-665001\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_diemmano_Fig1.png\" alt=\"\" width=\"877\" height=\"221\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_diemmano_Fig1.png 877w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_diemmano_Fig1-300x76.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_diemmano_Fig1-768x194.png 768w\" sizes=\"auto, (max-width: 877px) 100vw, 877px\" \/><p id=\"caption-attachment-665001\" class=\"wp-caption-text\">The deep CNN architecture in the vanilla transfer learning pipeline. Audio embeddings were extracted from layer FC1 of the network, trained on out-domain data. Layers CC1&#8230;CC5 are double convolutional layers and C6 is a single convolutional layer. FC1 layer includes batch normalization and ReLu activation while FC2 has sigmoid activation. The global pooling layer averages segment-level embeddings to utterance-level. The soft-max layer outputs a node per labeled class.<\/p><\/div>\n<div id=\"attachment_665004\" style=\"width: 937px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-665004\" class=\" wp-image-665004\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_diemmano_Fig2.png\" alt=\"\" width=\"927\" height=\"349\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_diemmano_Fig2.png 1266w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_diemmano_Fig2-300x113.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_diemmano_Fig2-1024x385.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_diemmano_Fig2-768x289.png 768w\" sizes=\"auto, (max-width: 927px) 100vw, 927px\" \/><p id=\"caption-attachment-665004\" class=\"wp-caption-text\">Illustration of a two-layer LadderNet, where x and x^ are input and reconstructed embeddings, y is the output label, and y~ the output of the noisy encoder, injected by Gaussian noise N (0, \u03c3^2). Decoder paths are characterized by denoising functions g(.)and denoising costs Cd(.) at each layer.<\/p><\/div>\n<p>&nbsp;<\/p>\n<div id=\"attachment_665007\" style=\"width: 883px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-665007\" class=\" wp-image-665007\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_diemmano_Fig3.png\" alt=\"\" width=\"873\" height=\"346\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_diemmano_Fig3.png 1381w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_diemmano_Fig3-300x119.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_diemmano_Fig3-1024x406.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_diemmano_Fig3-768x304.png 768w\" sizes=\"auto, (max-width: 873px) 100vw, 873px\" \/><p id=\"caption-attachment-665007\" class=\"wp-caption-text\">Effect of feature-based normalization on weighted classification accuracy on ESC-50 and proposed SEDSET audio data. Notice the robustness of LadderNet and SVM models against feature normalization, while underlying that SVM performance varied highly during parameter optimization. In the left panel, ELM appears poor in learning the representation without proper normalization of the input data, while LadderNet is slightly affected by it potentially by mismatches between train and test data.<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Audio event classi\ufb01cation is an important task for several applications such as surveillance, audio, video and multimedia retrieval etc. There are approximately 340 million people with hearing loss who can\u2019t perceive events happening around them. This paper establishes the CURE dataset which contains curated set of speci\ufb01c audio events most relevant for people with hearing [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[13556,243062,13553],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-611697","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-audio-acoustics","msr-research-area-medical-health-genomics","msr-locale-en_us"],"msr_publishername":"IEEE","msr_edition":"","msr_affiliation":"","msr_published_date":"2019-8-1","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_meta_manuscript_Dubey.pdf","id":"611709","title":"pacrim2019_meta_manuscript_dubey","label_id":"243132","label":0}],"msr_related_uploader":"","msr_attachments":[{"id":611709,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/PacRim2019_meta_manuscript_Dubey.pdf"}],"msr-author-ordering":[{"type":"guest","value":"harishchandra-dubey","user_id":559335,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=harishchandra-dubey"},{"type":"user_nicename","value":"Dimitra Emmanouilidou","user_id":37461,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Dimitra Emmanouilidou"},{"type":"user_nicename","value":"Ivan Tashev","user_id":32127,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Ivan Tashev"}],"msr_impact_theme":[],"msr_research_lab":[199565],"msr_event":[],"msr_group":[144923],"msr_project":[559086],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":559086,"post_title":"Audio Analytics","post_name":"audio-analytics","post_type":"msr-project","post_date":"2019-02-08 15:57:54","post_modified":"2023-01-13 13:28:08","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/audio-analytics\/","post_excerpt":"Audio analytics is about analyzing and understanding audio signals captured by digital devices, with numerous applications in enterprise, healthcare, productivity, and smart cities.","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/559086"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/611697","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":4,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/611697\/revisions"}],"predecessor-version":[{"id":665010,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/611697\/revisions\/665010"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=611697"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=611697"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=611697"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=611697"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=611697"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=611697"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=611697"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=611697"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=611697"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=611697"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=611697"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=611697"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=611697"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=611697"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=611697"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=611697"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}