Joint Time-Frequency and Time Domain Learning for Speech Enhancement.

2020 International Joint Conference on Artificial Intelligence |

Published by International Joint Conferences on Artificial Intelligence Organization

PDF

For single-channel speech enhancement, both timedomain and time-frequency-domain methods have their respective pros and cons. In this paper, we present a cross-domain framework named TFTNet, which takes time-frequency spectrogram as input and produces time-domain waveform as output. Such a framework takes advantage of the knowledge we have about spectrogram and avoids some of the drawbacks that T-F-domain methods have been suffering from. In TFT-Net, we design an innovative dual-path attention block (DAB) to fully exploit correlations along the time and frequency axes. We further discover that a sampleindependent DAB (SDAB) achieves a good tradeoff between enhanced speech quality and complexity. Ablation studies show that both the crossdomain design and the SDAB block bring large performance gain. When logarithmic MSE is used as the training criteria, TFT-Net achieves the highest SDR and SSNR among state-of-the art methods on two major speech enhancement benchmarks.