A typical audio processing process involves the extraction of acoustics features relevant to the task at hand, followed by decision-making schemes that involve detection, classification, and knowledge fusion. We use various approaches ranging from Gaussian mixture model-universal background model (GMM-UBM) and support vector machine (SVM) to the latest deep learning neural network architectures.<\/p>\n
Emotion recognition is one of the first areas of audio analytics that we started to explore. We designed a series of neural network architectures and worked with both public (IEMOCAP, eNTERFACE, Berlin) and private (Xbox, Cortana, XiaoIce) datasets. Labeling typically happens by using crowdsourcing, which gives us an understanding of how well humans perform on the task. Speech emotion recognition is a challenging problem, as different people express their emotions in different ways\u2014even human annotators sometimes cannot agree on the exact emotion labels. A multimodal approach, based on both audio and text (the output of a speech recognizer), can provide significant improvement in model performance. When it comes to other non-verbal cues extraction tasks, such as age and gender detection from speech, these are easier tasks and here our neural networks perform as well as human labelers.<\/p>\n
Emotion, gender, and age recognition results.<\/p><\/div>\n
Audio events detection was an early project in our audio understanding research. Using pre-trained CNN models as feature extractors, we enable knowledge-transfer from other data domains, which can significantly enrich the in-domain feature representation and separability. We combined knowledge-transfer with fully connected neural networks and achieved classification results close to human labelers, when testing on a noisy dataset with more than ten types of audio classes.<\/p>\n
Sound event detection results.Audio search is the most complex and difficult of all three research directions. We have built a prototype of an audio search engine that uses joint text-audio embeddings for query indexing and personalized content search, and it shows promising results when compared to equivalent search approaches that rely on text information alone. We also proposed a first deep hashing technique for efficient audio retrieval, demonstrating that a low-bit coding representation combined with very few training samples (~10 samples per class) can achieve high mAP@R values for audio retrieval.<\/p>\n
\u00a0<\/p>\n","protected":false},"excerpt":{"rendered":"
Audio analytics is about analyzing and understanding audio signals captured by digital devices, with numerous applications in enterprise, healthcare, productivity, and smart cities.<\/p>\n","protected":false},"featured_media":492002,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[243062,13551,13552],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-559086","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-audio-acoustics","msr-research-area-graphics-and-multimedia","msr-research-area-hardware-devices","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[574758,544869,372032,372020,168755,167542,1087872,612726,758989,759025,803650,816130,820753,854880,924168,161420,938583,611670,1084854,611697],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Dimitra Emmanouilidou","user_id":37461,"people_section":"Project Team","alias":"diemmano"},{"type":"user_nicename","display_name":"Hannes Gamper","user_id":31943,"people_section":"Project Team","alias":"hagamper"},{"type":"user_nicename","display_name":"David Johnston","user_id":31562,"people_section":"Project Team","alias":"davidjo"},{"type":"user_nicename","display_name":"Ivan Tashev","user_id":32127,"people_section":"Project Team","alias":"ivantash"},{"type":"guest","display_name":"Wei-Cheng Lin","user_id":814636,"people_section":"Past Interns","alias":""},{"type":"guest","display_name":"Midia Yousefi","user_id":814642,"people_section":"Past Interns","alias":""},{"type":"guest","display_name":"Benjamin Martinez Elizalde","user_id":664698,"people_section":"Past Interns","alias":""},{"type":"guest","display_name":"Morayo Ogunsina","user_id":664692,"people_section":"Past Interns","alias":""},{"type":"guest","display_name":"Arindam Jati","user_id":661587,"people_section":"Past Interns","alias":""},{"type":"guest","display_name":"Harishchandra Dubey","user_id":559335,"people_section":"Past Interns","alias":""},{"type":"guest","display_name":"Benjamin Martinez Elizalde","user_id":559338,"people_section":"Past Interns","alias":""},{"type":"guest","display_name":"Srinivas Parthasarathy","user_id":559329,"people_section":"Past Interns","alias":""},{"type":"guest","display_name":"Zhong-Qiu Wang","user_id":559326,"people_section":"Past Interns","alias":""},{"type":"guest","display_name":"Jinkyu Lee","user_id":559185,"people_section":"Past Interns","alias":""},{"type":"guest","display_name":"Kun Han","user_id":559182,"people_section":"Past Interns","alias":""},{"type":"guest","display_name":"Keith Godin","user_id":663192,"people_section":"Past Interns","alias":""},{"type":"guest","display_name":"Hoang Do","user_id":663186,"people_section":"Past Interns","alias":""}],"msr_research_lab":[199565],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/559086"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":18,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/559086\/revisions"}],"predecessor-version":[{"id":665973,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/559086\/revisions\/665973"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/492002"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=559086"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=559086"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=559086"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=559086"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=559086"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}