Towards Code-Switching ASR for End-to-End CTC Models
Although great progress has been made on end-to-end (E2E) models for monolingual and multilingual automatic speech recognition (ASR), there is no successful study for E2E models on the challenging intra-sentential code-switching (CS) ASR task to our best knowledge. In this paper, we propose an approach for CS ASR using E2E connectionist temporal classification (CTC) models. We use a frame-level language identification model to linearly adjust the posteriors of an E2E CTC model. We evaluate the proposed method on Microsoft live Chinese Cortana data with 7000 hours Chinese and English monolingual data and 300 hours CS data as the training data. Trained with only monolingual data without observing any CS data, the proposed method can obtain up to 6.3% relative word error rate (WER) reduction. In the scenario of training with both monolingual and CS data, the proposed method can get up to 4.2% relative WER improvement. This approach can also maintain comparable performance on a Chinese test set compared with baseline models.