Unsupervised Speech Enhancement
Speech enhancement systems are built to remove background noise and reverberation from speech signals. It can be applied in video conferencing systems, virtual assistants, hearing aids, mobile, smart home devices, etc…Conventional speech enhancement systems are trained with supervised learning methods, which require a pair of studio-quality clean target speech and synthetic noisy mixture. The requirement of a ground truth clean speech dataset has disadvantages because it is harder to scale and not diverse enough, which makes the trained model not robust to real-world scenarios. Moreover, it is expensive to record clean speech and noise on the same domain with inference data. Additionally, conventional speech enhancement systems can lead to automatic speech recognition (ASR) performance degradation.
Our project goals are to improve the perceptual quality of enhanced speech, utilize abundant real noisy recording instead of relying on expensive studio-quality data and mitigate the problem of ASR performance degradation when utilizing speech enhancement. To fulfill our goals, we proposed a weighted loss function that combines speech recognition embedding and disentanglement related losses with MixIT loss, a recently introduced unsupervised loss for speech separation. In addition, we evaluate the amount of added noise to the performance of the unsupervised speech enhancement system. We also investigate how to enforce disentanglement between speech and noise to get the best performance.
Our results show that the proposed loss function has successfully improved perceptual quality of speech and reduced speech recognition error rate on the noisy dataset VoxCeleb. We also find that enforce disentanglement of speech and noise at ASR embedding level achieve a better result than at spectrogram level.
- Date:
- Haut-parleurs:
- Viet Anh Trinh
- Affiliation:
- City University of New York (CUNY)