Convolutional Neural Network Techniques for Speech Emotion Recognition

International Workshop on Acoustic Signal Enhancement (IWAENC) |

Affect recognition plays an important role in human computer interaction (HCI). Speech is one of the primary forms of expression and is an important modality for emotion recognition. While multiple recognition systems exist, the most common ones identify discrete categories such as happiness, sadness, from distinct utterances that are a few seconds long. In many cases the datasets, used for training and evaluation, are imbalanced across the emotion labels. This leads to big discrepancies between the unweighted accuracy (UA) and weighted accuracy (WA). Recently Deep Neural Networks have shown increased performance for the emotion classification task. In particular Convolutional Neural Networks capture contextual information from speech feature frames. In this paper we analyze various convolutional architectures for speech emotion recognition. We report performance on different frame level features. Further we analyze various pooling techniques, on top of convolutional layers, to get a utterance level representation for the emotion. Our best system provides a performance of UA+WA of 121.15 compared to the baseline algorithm performance of 118.10.