Project CLAP aims to advance the state-of-the-art audio technologies and develop the next generation framework for understanding acoustic cues.
The human auditory system can hear sounds and extract the kind of decisions or meanings we need to interact with our surroundings. We can listen to an acoustic scene say domestic household and identify if someone is washing dishes, people talking, and the different appliances running, and make inferences ranging from something simple like the number of people speaking in the room to more nuanced inferences like the mood of the environment.
The breakthrough of Deep Learning (DL) for audio in 2012 and the yearly series of DCASE (Detection and Classification of Acoustic Scenes and Events) since 2013 have led to remarkable progress in Machine Listening and Sound Understanding. For example, the accuracy of DL models on ESC-50 (environmental sound classification) has been dramatically improved from 65% (before 2016) to 94% (in 2020). However, current techniques still see a large gap in performance between curated datasets and everyday applications. This gap in performance between lab datasets and real applications further increases for domains like Acoustic Scene Classification (ASC) and Speech Emotion Recognition (SER).
Motivated by strong demand from everyday applications and recent progress on foundation models that employ cross-modality learning and large-scale processing, we strive to advance the state-of-the-art in understanding audio content. We aim to accelerate Microsoft’s acoustic intelligence to successfully impact a wide range of products and empower people.