October 25, 2020 – October 29, 2020

Microsoft at INTERSPEECH 2020

Location: Virtual

All times are displayed in GMT +8

Sunday, October 25

20:00 – 21:30 | Tutorial B-2-1
Neural Approaches to Conversational Information Retrieval
Jianfeng Gao, Chenyan Xiong, Paul Bennett

20:00 – 21:30 | Tutorial B-3-1
Neural Models for Speaker Diarization in the Context of Speech Recognition
Kyu J. Han, Tae Jin Park, Dimitrios Dimitriadis

21:45 – 23:15 | Tutorial B-2-2
Neural Approaches to Conversational Information Retrieval
Jianfeng Gao, Chenyan Xiong, Paul Bennett

21:45 – 23:15 | Tutorial B-3-2
Neural Models for Speaker Diarization in the Context of Speech Recognition
Kyu J. Han, Tae Jin Park, Dimitrios Dimitriadis

Monday, October 26

19:15 – 20:15 | ASR neural network architectures I
On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition (Microsoft Research Asia)
Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu

19:15 – 20:15 | ASR neural network architectures I
Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers
Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka

19:15 – 20:15 | Multi-channel speech enhancement
Online directional speech enhancement using geometrically constrained independent vector analysis
Li Li, Kazuhito Koishida, Shoji Makino

19:15 – 20:15 | Multi-channel speech enhancement
An End-to-end Architecture of Online Multi-channel Speech Separation
Jian Wu, Zhuo Chen, Jinyu Li, Takuya Yoshioka, Zhili Tan

19:15 – 20:15 | Speech Signal Representation
Robust pitch regression with voiced/unvoiced classification in nonstationary noise environments
Dung Tran, Uros Batricevic, Kazuhito Koishida

19:15 – 20:15 | Speaker Diarization
Online Speaker Diarization with Relation Network
Xiang Li, Yucheng Zhao, Chong Luo, Wenjun Zeng

19:15 – 20:15 | Speaker Diarization
Speaker attribution with voice profiles by graph-based semi-supervised learning
Jixuan Wang (University of Toronto), Xiong Xiao, Jian Wu, Ranjani Ramamurthy, Frank Rudzicz (University of Toronto) and Michael Brudno (University of Toronto)

19:15 – 20:15 | Noise robust and distant speech recognition
Neural Speech Separation Using Spatially Distributed Microphones
Dongmei Wang, Zhuo Chen and Takuya Yoshioka

20:30 – 21:30 | ASR neural network architectures and training I
Fast and Slow Acoustic Model
Kshitiz Kumar, Emilian Stoimenov, Hosam Khalil, Jian Wu

20:30 – 21:30 | Evaluation of Speech Technology Systems and Methods for Resource Construction and Annotation
Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System
Kai Fan, Bo Li, Jiayi Wang, Shiliang Zhang, Boxing Chen, Niyu Ge, Zhi-Jie Yan

20:30 – 21:30 | ASR model training and strategies
Semantic Mask for Transformer based End-to-End Speech Recognition
Chengyi Wang, Yu Wu, Yujiao Du, Jinyu Li, Shujie Liu, Liang Lu, Shuo Ren, Guoli Ye, Sheng Zhao, Ming Zhou

20:30 – 21:30 | ASR model training and strategies
A Federated Approach in Training Acoustic Models
Dimitrios Dimitriadis, Kenichi Kumatani, Robert Gmyr, Yashesh Gaur, Sefik Emre Eskimez

21:45 – 22:45 | Cross/multi-lingual and code-switched speech recognition
A 43 Language Multilingual Punctuation Prediction Neural Network Model
Xinxing Li, Edward Lin

21:45 – 22:45 | Singing Voice Computing and Processing in Music
Transfer Learning for Improving Singing-Voice Detection in Polyphonic Instrumental Music
Yuanbo Hou, Frank Soong, Jian Luan, Shengchen Li

21:45 – 22:45 | Acoustic model adaptation for ASR
Rapid RNN-T Adaptation Using Personalized Speech Synthesis and Neural Language Generator
Yan Huang, Jinyu Li, Lei He, Wenning Wei, William Gale, Yifan Gong

21:45 – 22:45 | Singing and Multimodal Synthesis
Adversarially Trained Multi-Singer Sequence-To-Sequence Singing Synthesizer
Jie Wu, Jian Luan

21:45 – 22:45 | Singing and Multimodal Synthesis
XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System
Peiling Lu, Jie Wu, Jian Luan, Xu Tan, Li Zhou

21:45 – 22:45 | Student Events
ISCA-SAC: 2nd Mentoring Event
Mentor: Jinyu Li

Tuesday, October 27

19:15 – 20:15 | Feature extraction and distant ASR
Bandpass Noise Generation and Augmentation for Unified ASR
Kshitiz Kumar, Bo Ren, Yifan Gong, Jian Wu

19:15 – 20:15 | Search for speech recognition
Combination of end-to-end and hybrid models for speech recognition
Jeremy Heng Meng Wong, Yashesh Gaur, Rui Zhao, Liang Lu, Eric Sun, Jinyu Li, Yifan Gong

Wednesday, October 28

19:15 – 20:15 | Streaming ASR
1-D Row-Convolution LSTM: Fast Streaming ASR at Accuracy Parity with LC-BLSTM
Kshitiz Kumar, Chaojun Liu, Yifan Gong, Jian Wu

19:15 – 20:15 | Streaming ASR
Low Latency End-to-End Streaming Speech Recognition with a Scout Network
Chengyi Wang, Yu Wu, Liang Lu, Shujie Liu, Jinyu Li, Guoli Ye, Ming Zhou

19:15 – 20:15 | Streaming ASR
Transfer Learning Approaches for Streaming End-to-End Speech Recognition System
Vikas Joshi, Rui Zhao, Rupesh Mehta, Kshitiz Kumar, Jinyu Li

19:15 – 20:15 | Applications of ASR
SpecMark: A Spectral Watermarking Framework for IP Protection of Speech Recognition Systems
Huili Chen, Bita Darvish Rouhani, Farinaz Koushanfar

19:15 – 20:15 | Single-channel speech enhancement I
Low-Latency Single Channel Speech Dereverberation using U-Net Convolutional Neural Networks
Ahmet E. Bulut, Kazuhito Koishida

19:15 – 20:15 | Single-channel speech enhancement I
Single-channel speech enhancement by subspace affinity minimization
Dung Tran, Kazuhito Koishida

19:15 – 20:15 | Deep Noise Suppression Challenge
The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results
Chandan Karadagur, Ananda Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, Puneet Rana, Sriram Srinivasan, Johannes Gehrke

20:30 – 21:30 | Spoken Term Detection
Re-weighted Interval Loss for Handling Data Imbalance Problem of End-to-End Keyword Spotting
Kun Zhang, Zhiyong Wu, Daode Yuan, Jian Luan, Jia Jia, Helen Meng, Binheng Song

20:30 – 21:30 | Training strategies for ASR
Serialized Output Training for End-to-End Overlapped Speech Recognition
Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka

20:30 – 21:30 | Speech transmission & coding
An Open source Implementation of ITU-T Recommendation P.808 with Validation
Babak Naderi, Ross Cutler

20:30 – 21:30 | Speech transmission & coding
DNN No-Reference PSTN Speech Quality Prediction
Gabriel Mittag, Ross Cutler, Yasaman Hosseinkashi, Michael Revow, Sriram Srinivasan, Naglakshmi Chande, Robert Aichner

20:30 – 21:30 | Speech Synthesis: Multilingual and Cross-lingual approaches
On Improving Code Mixed Speech Synthesis with Mixlingual Grapheme-to-Phoneme Model
Shubham Bansal, Arijit Mukherjee, Sandeepkumar Satpal, Rupesh Mehta

21:45 – 22:45 | Speech Synthesis Paradigms and Methods II
Towards Universal Text-to-Speech
Jingzhou Yang, Lei He

21:45 – 22:45 | Speech Synthesis Paradigms and Methods II
Enhancing Monotonicity for Robust Autoregressive Transformer TTS
Xiangyu Liang, Zhiyong Wu, Runnan Li, Yanqing Liu, Sheng Zhao

21:45 – 22:45 | Speech Synthesis: Prosody and Emotion
Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis
Yukiya Hono, Kazuna Tsuboi, Kei Sawada, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda

21:45 – 22:45 | Speech Synthesis: Prosody and Emotion
GAN-based Data Generation for Speech Emotion Recognition
Sefik Emre Eskimez, Dimitrios Dimitriadis, Robert Gmyr, Kenichi Kumatani

21:45 – 22:45 | Student Events
ISCA-SAC: 7th Students Meet the Experts
Panelist: Sunayana Sitaram

Thursday, October 29

19:15 – 20:15 | Speech Synthesis: Neural Waveform Generation II
An Efficient Subband Linear Prediction for LPCNet-based Neural Synthesis
Yang Cui, Xi Wang, Lei He, Frank Soong

19:15 – 20:15 | ASR neural network architectures and training II
Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability
Jinyu Li, Rui Zhao, Zhong Meng, Yanqing Liu, Wenning Wei, Sarangarajan Parthasarathy, Vadim Mazalov, Zhenghao Wang, Lei He, Sheng Zhao, Yifan Gong

19:15 – 20:15 | New Trends in self-supervised speech processing
Sequence-level Self-learning with Multiple Hypotheses
Kenichi Kumatani, Dimitrios Dimitriadis, Robert Gmyr, Yashesh Gaur, Sefik Emre Eskimez, Jinyu Li, Michael Zeng

19:15 – 20:15 | Spoken Dialogue System
Discriminative Transfer Learning for Optimizing ASR and Semantic Labeling in Task-oriented Spoken Dialog
Yao Qian, Yu Shi, Michael Zeng

19:15 – 20:15 | Spoken Dialogue System
Datasets and Benchmarks for Task-Oriented Log Dialogue Ranking Task
Xinnuo Xu, Yizhe Zhang, Lars Liden, Sungjin Lee

19:15 – 20:15 | Speech Synthesis: Toward End-to-End Synthesis
MoBoAligner: a Neural Alignment Model for Non-autoregressive TTS with Monotonic Boundary Search
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu

19:15 – 20:15 | Speech Synthesis: Toward End-to-End Synthesis
MultiSpeech: Multi-Speaker Text to Speech with Transformer
Mingjian Chen, Xu Tan, Yi Ren, Jin Xu, Hao Sun, Sheng Zhao, Tao Qin, Tie-Yan Liu

20:30 – 21:30 | Speech Synthesis: Prosody Modeling
Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency
Matt Whitehill, Shuang Ma, Daniel McDuff, Yale Song

21:45 – 22:45 | Multilingual and code-switched ASR
Improving Low Resource Code-switched ASR using Augmented Code-switched TTS
Yash Sharma, Basil Abraham, Karan Taneja, Preethi Jyothi

21:45 – 22:45 | ASR neural network architectures II – Transformers
Exploring Transformers for Large-Scale Speech Recognition
Liang Lu, Changliang Liu, Jinyu Li, Yifan Gong