Audio analytics is about analyzing and understanding audio signals captured by digital devices, with numerous applications in enterprise, healthcare, productivity, and smart cities. Applications include customer satisfaction analysis from customer support calls, media content analysis and retrieval, medical diagnostic aids and patient monitoring, assistive technologies for people with hearing impairments, and audio analysis for public safety.
We are working on three research directions in this area:
- Extracting non-verbal cues from human speech. This refers to analyzing a human voice to extract information beyond speech recognition, including speaker identification and verification: age, gender, and emotional state.
- Audio understanding. This aims to analyze and extract insights from audio signals such as detecting audio events, recognizing audio backgrounds, and detecting audio anomalies.
- Audio search. This focus area refers to audio data search mechanisms, essential for navigating through large amounts of raw audio data and metadata. Audio search includes description and annotation of audio data, querying and indexing, and ranking and retrieval.
A typical audio processing process involves the extraction of acoustics features relevant to the task at hand, followed by decision-making schemes that involve detection, classification, and knowledge fusion. We use various approaches ranging from Gaussian mixture model-universal background model (GMM-UBM) and support vector machine (SVM) to the latest deep learning neural network architectures.
Emotion recognition is one of the first areas of audio analytics that we started to explore. We designed a series of neural network architectures and worked with both public (IEMOCAP, eNTERFACE, Berlin) and private (Xbox, Cortana, XiaoIce) datasets. Labeling typically happens by using crowdsourcing, which gives us an understanding of how well humans perform on the task. Speech emotion recognition is a challenging problem, as different people express their emotions in different ways—even human annotators sometimes cannot agree on the exact emotion labels. A multimodal approach, based on both audio and text (the output of a speech recognizer), can provide significant improvement in model performance. When it comes to other non-verbal cues extraction tasks, such as age and gender detection from speech, these are easier tasks and here our neural networks perform as well as human labelers.
Audio events detection was an early project in our audio understanding research. Using pre-trained CNN models as feature extractors, we enable knowledge-transfer from other data domains, which can significantly enrich the in-domain feature representation and separability. We combined knowledge-transfer with fully connected neural networks and achieved classification results close to human labelers, when testing on a noisy dataset with more than ten types of audio classes.
Sound event detection results.Audio search is the most complex and difficult of all three research directions. We have built a prototype of an audio search engine that uses joint text-audio embeddings for query indexing and personalized content search, and it shows promising results when compared to equivalent search approaches that rely on text information alone. We also proposed a first deep hashing technique for efficient audio retrieval, demonstrating that a low-bit coding representation combined with very few training samples (~10 samples per class) can achieve high mAP@R values for audio retrieval.