Microsoft at ICASSP 2020 in Barcelona, Spain
May 4, 2020 - May 8, 2020

Microsoft @ ICASSP 2020

Location: Virtual

Tuesday, May 5

11:30 – 13:30 CEST

MLSP-P2: Applications in Speech and Audio
Multi-Label Sound Event Retrieval Using A Deep Learning-Based Siamese Structure With A Pairwise Presence Matrix (opens in new tab)
Jianyu Fan, Eric NicholsDaniel Tompkins, Ana Elisa Méndez Méndez, Benjamin Elizalde, Philippe Pasquier

11:50 – 12:10 CEST

SPE-L1: End-to-end Speech Recognition I: Streaming
Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR (opens in new tab)
Hirofumi Inaguma, Yashesh GaurLiang LuJinyu Li (opens in new tab)Yifan Gong (opens in new tab)

16:30 – 18:30 CEST

SPE-P3: Machine Learning for Speech Synthesis I
Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS (opens in new tab)
Yujia XiaoLei HeHuaiping MingFrank K. Soong (opens in new tab)

17:30 – 17:50 CEST

AUD-L2: Deep Learning for Source Separation
Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation (opens in new tab)
Yi Luo, Zhuo ChenTakuya Yoshioka (opens in new tab)


Wednesday, May 6

9:00 – 11:00 CEST

AUD-P4: Feedback, Noise, and Reverberation
Joint Beamforming and Reverberation Cancellation Using a Constrained Kalman Filter with Multichannel Linear Prediction (opens in new tab)
Sahar Hashemgeloogerdi, Sebastian Braun (opens in new tab)

AUD-P4: Feedback, Noise, and Reverberation
Predicting Word Error Rate for Reverberant Speech (opens in new tab)
Hannes Gamper (opens in new tab)Dimitra Emmanouilidou (opens in new tab)Sebastian Braun (opens in new tab)Ivan Tashev (opens in new tab)

SPE-P5: Deep Speaker Recognition Models
Improving Deep CNN Networks with Long Temporal Context for Text-independent Speaker Verification (opens in new tab)
Yong ZhaoTianyan ZhouZhuo ChenJian Wu

9:20 – 9:40 CEST

SPE-L6: Speech Enhancement II: Single Channel
Low-Latency Single Channel Speech Enhancement Using U-Net Convolutional Neural Networks (opens in new tab)
Ahmet E. BulutKazuhito Koishida (opens in new tab)

11:30 – 13:30 CEST

SAM-P3: Sparsity, Super-Resolution and Imaging
Low-Rank Toeplits Matrix Estimation Via Random Ultra-Sparse Rulers (opens in new tab)
Hannah Lawrence, Jerry Li (opens in new tab), Cameron Musco, Christopher Musco

SPE-P8: Robust Speech Recognition
A Practical Two-Stage Training Strategy for Multi-Stream End-to-End Speech Recognition (opens in new tab)
Ruizhi Li, Gregory Sell, Xiaofei Wang (opens in new tab), Shinji Watanabe, Hynek Hermansky

16:30 – 16:50 CEST

IFS-L2: Privacy, Biometrics and Information Security
Privacy-Preserving Phishing Web Page Classification Via Fully Homomorphic Encryption (opens in new tab)
Edward Chou, Arun GururajanKim Laine (opens in new tab)Nitin Kumar GoelAnna BertigerJack W. Stokes (opens in new tab)

16:30 – 18:30 CEST

HLT-P1: Spoken Language Understanding and Dialogue I
Fast Domain Adaptation for Goal-Oriented Dialogue Using A Hybrid Generative-Retrieval Transformer (opens in new tab)
Igor Shalyminov, Alessandro Sordoni (opens in new tab)Adam Atkinson (opens in new tab)Hannes Schulz (opens in new tab)

SPE-P9: End-to-end Speech Recognition III: General Topics
Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition (opens in new tab)
Hu Hu, Rui ZhaoJinyu Li (opens in new tab)Liang LuYifan Gong (opens in new tab)


Thursday, May 7

9:00 – 11:00 CEST

HLT-P2: Speech and Language Analysis
Combining Acoustics, Content and interaction Features to Find Hot Spots in Meetings (opens in new tab)
Dave Makhervaks, William Hinthorn (opens in new tab)Dimitrios Dimitriadis (opens in new tab), Andreas Stolcke

10:20 – 10:40 CEST

AUD-L6: Acoustic Environments and Spatial Audio II
Fast Acoustic Scattering Using Convolutional Neural Networks (opens in new tab)
Ziqi Fan, Vibhav Vineet (opens in new tab)Hannes Gamper (opens in new tab)Nikunj Raghuvanshi (opens in new tab)

10:40 – 11:00 CEST

SPE-L11: Speech Separation and Extraction I: Single Channel
An Online Speaker-Aware Speech Separation Approach Based on Time-Domain Representation (opens in new tab)
Hui Wang, Yan Song, Zeng-Xi Li, Ian McLoughlin, Li-Rong Dai

11:30 – 13:30 CEST

SPE-P12: Machine Learning for Speech Synthesis II
Improving LPCNET-Based Text-to-Speech with Linear Prediction-Structured Mixture Density Network (opens in new tab)
Min-Jae Hwang, Eunwoo Song, Ryuichi Yamamoto, Frank Soong (opens in new tab), Hong-Goo Kang

SPE-P13: Speech Separation and Extraction III
Continuous Speech Separation: Dataset and Analysis (opens in new tab)
Zhuo ChenTakuya Yoshioka (opens in new tab)Liang LuTianyan ZhouZhong MengYi LuoJian WuXiong XiaoJinyu Li (opens in new tab)

12:10 – 12:30 CEST

SPE-L12: Speech Separation and Extraction II: Multi-channel
End-to-End Microphone Permutation and Number Invariant Multi-Channel Speech Separation (opens in new tab)
Yi Luo, Zhuo Chen, Nima Mesgarani, Takuya Yoshioka (opens in new tab)

16:30 – 18:30 CEST

MMSP-P3:  Multimedia Signal Processing
Supervised Deep Hashing for Efficient Audio Event Retrieval (opens in new tab)
Arindam Jati, Dimitra Emmanouilidou (opens in new tab)

MMSP-P3:  Multimedia Signal Processing
Multimodal Active Speaker Detection and Virtual Cinematography for Video Conferencing (opens in new tab)
Ross Cutler, Ramin Mehran, Sam Johnson, Cha Zhang, Adam Kirk, Oliver Whyte, Adarsh Kowdle

SPE-P15: Speech Recognition: Adaptation
L-Vector: Neural Label Embedding for Domain Adaptation (opens in new tab)
Zhong Meng, Hu Hu, Jinyu Li (opens in new tab)Changliang LiuYan HuangYifan Gong (opens in new tab), Chin-Hui Lee

SPE-P15: Speech Recognition: Adaptation
Acoustic Model Adaptation for Presentation Transcription and Intelligent Meeting Assistant Systems (opens in new tab)
Yan HuangYifan Gong (opens in new tab)

SPE-P15: Speech Recognition: Adaptation
Using Personalized Speech Synthesis and Neural Language Generator for Rapid Speaker Adaptation (opens in new tab)
Yan HuangLei HeWenning WeiWilliam GaleJinyu Li (opens in new tab)Yifan Gong (opens in new tab)

SS-P1: Signal Processing Education: Trends and Innovations
A Dataset for Measuring Reading Levels in India at Scale (opens in new tab)
Dolly Agarwal, Jayant Gupchup, Nishant Baghel

17:30 – 17:30 CEST

IDSP-L2: Industry Session on Large-Scale Distributed Learning Strategies
Parallelizing Adam Optimizer with Blockwise Model-Update Filtering (opens in new tab)
Kai Chen, Haisong Ding, Qiang Huo


Friday, May 8

8:00 – 10:00 CEST

IFS-P1: Information Hiding, Biometrics and Security
Texception: A Character/Word-Level Deep Learning Model for Phishing URL Detection (opens in new tab)
Farid TajaddodianfarJack W. Stokes (opens in new tab)Arun Gururajan

SAM-P6: Detection, Estimation and Classification
Static Visual Spatial Priors For DOA Estimation (opens in new tab)
Pawel Swietojanski, Ondrej Miksik

SPE-P16: Word Spotting
Adaptation of RNN Transducer with Text-to-Speech Technology for Keyword Spotting (opens in new tab)
Eva SharmaGuoli YeWenning WeiRui ZhaoYao TianJian WuLei HeEd LinYifan Gong (opens in new tab)

SPE-P17: Speech Enhancement IV
AV(SE) ²: Audio-Visual Squeeze-Excite Speech Enhancement (opens in new tab)
Michael Iuzzolino, Kazuhito Koishida (opens in new tab)

8:20 – 8:40 CEST

HLT-L2: Language Modeling
Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers (opens in new tab)
Junhao Xu, Xie Chen, Shoukang Hu, Jianwei Yu, Xunying Liu, Helen Mei-Ling Meng

9:40 – 10:00 CEST

MLSP-L10: Deep Neural Network Structures
Neural Attentive Multiview Machines (opens in new tab)
Oren BarkanOri KatzNoam Koenigstein

11:45 – 13:45 CEST

AUD-P11: Signal Enhancement and Restoration II
Geometrically Constrained Independent Vector Analysis for Directional Speech Enhancement (opens in new tab)
Li Li, Kazuhito Koishida (opens in new tab)

AUD-P11: Signal Enhancement and Restoration II
Weighted Speech Distortion Losses for Neural-Network-Based Real-Time Speech Enhancement (opens in new tab)
Yangyang Xia, Sebastian BraunChandan ReddyHarishchandra DubeyRoss CutlerIvan Tashev

HLT-P5: Multilingual Processing of Language
Addressing Accent Mismatch in Mandarin-English Code-Switching Speech Recognition (opens in new tab)
Zhili TanXinghua FanHui ZhuEd Lin

IFS-P2: Anonymization, Security and Privacy
Detection of Malicious VSCRIPT Using Static and Dynamic Analysis with Recurrent Deep Learning (opens in new tab)
Jack W. Stokes (opens in new tab), Rakshit Agrawal, Geoff McDonald

SPE-P19: Machine Learning for Speech Synthesis III
ESPNET-TTS: Unified, Reproducible, and Integartable Open Source End-to-End Text-to-Speech Toolkit (opens in new tab)
Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, Xu Tan (opens in new tab)

SPE-P20: Speech Recognition: Acoustic Modelling II
High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model (opens in new tab)
Jinyu Li (opens in new tab)Rui ZhaoEric SunJeremy WongAmit DasZhong MengYifan Gong (opens in new tab)

12:25 – 12:45 CEST

SPE-L16: Speaker Diarization
Speaker Diarization with Session-Level Speaker Embedding Refinement Using Graph Neural Networks (opens in new tab)
Jixuan Wang, Xiong XiaoJian WuRanjani Ramamurthy (opens in new tab), Frank Rudzicz, Michael Brudno

13:05 – 13:25 CEST

SPE-L16: Speaker Diarization
A Memory Augmented Architecture for Continuous Speaker Identification in Meetings (opens in new tab)
Nikolaos Flemotomos, Dimitrios Dimitriadis (opens in new tab)

15:15 – 17:15 CEST

SPE-P21: Voice Conversion
An Improved Frame-Unit-Selection Based Voice Conversion System Without Parallel Training Data (opens in new tab)
Feng-Long Xie, Xin-Hui Li, Bo Liu, Yi-Bin Zheng, Li Meng, Li Lu, Frank K. Soong (opens in new tab)

16:15 – 16:30 CEST

MLSP-L11: Attention Needs
Attentive Item2vec: Neural Attentive User Representations (opens in new tab)
Oren Barkan, Avi Caciularu, Ori KatzNoam Koenigstein