Tuesday, May 5
11:30 – 13:30 CEST
MLSP-P2: Applications in Speech and Audio
Multi-Label Sound Event Retrieval Using A Deep Learning-Based Siamese Structure With A Pairwise Presence Matrix (opens in new tab)
Jianyu Fan, Eric Nichols, Daniel Tompkins, Ana Elisa Méndez Méndez, Benjamin Elizalde, Philippe Pasquier
11:50 – 12:10 CEST
SPE-L1: End-to-end Speech Recognition I: Streaming
Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR (opens in new tab)
Hirofumi Inaguma, Yashesh Gaur, Liang Lu, Jinyu Li (opens in new tab), Yifan Gong (opens in new tab)
16:30 – 18:30 CEST
SPE-P3: Machine Learning for Speech Synthesis I
Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS (opens in new tab)
Yujia Xiao, Lei He, Huaiping Ming, Frank K. Soong (opens in new tab)
17:30 – 17:50 CEST
AUD-L2: Deep Learning for Source Separation
Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation (opens in new tab)
Yi Luo, Zhuo Chen, Takuya Yoshioka (opens in new tab)
Wednesday, May 6
9:00 – 11:00 CEST
AUD-P4: Feedback, Noise, and Reverberation
Joint Beamforming and Reverberation Cancellation Using a Constrained Kalman Filter with Multichannel Linear Prediction (opens in new tab)
Sahar Hashemgeloogerdi, Sebastian Braun (opens in new tab)
AUD-P4: Feedback, Noise, and Reverberation
Predicting Word Error Rate for Reverberant Speech (opens in new tab)
Hannes Gamper (opens in new tab), Dimitra Emmanouilidou (opens in new tab), Sebastian Braun (opens in new tab), Ivan Tashev (opens in new tab)
SPE-P5: Deep Speaker Recognition Models
Improving Deep CNN Networks with Long Temporal Context for Text-independent Speaker Verification (opens in new tab)
Yong Zhao, Tianyan Zhou, Zhuo Chen, Jian Wu
9:20 – 9:40 CEST
SPE-L6: Speech Enhancement II: Single Channel
Low-Latency Single Channel Speech Enhancement Using U-Net Convolutional Neural Networks (opens in new tab)
Ahmet E. Bulut, Kazuhito Koishida (opens in new tab)
11:30 – 13:30 CEST
SAM-P3: Sparsity, Super-Resolution and Imaging
Low-Rank Toeplits Matrix Estimation Via Random Ultra-Sparse Rulers (opens in new tab)
Hannah Lawrence, Jerry Li (opens in new tab), Cameron Musco, Christopher Musco
SPE-P8: Robust Speech Recognition
A Practical Two-Stage Training Strategy for Multi-Stream End-to-End Speech Recognition (opens in new tab)
Ruizhi Li, Gregory Sell, Xiaofei Wang (opens in new tab), Shinji Watanabe, Hynek Hermansky
16:30 – 16:50 CEST
IFS-L2: Privacy, Biometrics and Information Security
Privacy-Preserving Phishing Web Page Classification Via Fully Homomorphic Encryption (opens in new tab)
Edward Chou, Arun Gururajan, Kim Laine (opens in new tab), Nitin Kumar Goel, Anna Bertiger, Jack W. Stokes (opens in new tab)
16:30 – 18:30 CEST
HLT-P1: Spoken Language Understanding and Dialogue I
Fast Domain Adaptation for Goal-Oriented Dialogue Using A Hybrid Generative-Retrieval Transformer (opens in new tab)
Igor Shalyminov, Alessandro Sordoni (opens in new tab), Adam Atkinson (opens in new tab), Hannes Schulz (opens in new tab)
SPE-P9: End-to-end Speech Recognition III: General Topics
Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition (opens in new tab)
Hu Hu, Rui Zhao, Jinyu Li (opens in new tab), Liang Lu, Yifan Gong (opens in new tab)
Thursday, May 7
9:00 – 11:00 CEST
HLT-P2: Speech and Language Analysis
Combining Acoustics, Content and interaction Features to Find Hot Spots in Meetings (opens in new tab)
Dave Makhervaks, William Hinthorn (opens in new tab), Dimitrios Dimitriadis (opens in new tab), Andreas Stolcke
10:20 – 10:40 CEST
AUD-L6: Acoustic Environments and Spatial Audio II
Fast Acoustic Scattering Using Convolutional Neural Networks (opens in new tab)
Ziqi Fan, Vibhav Vineet (opens in new tab), Hannes Gamper (opens in new tab), Nikunj Raghuvanshi (opens in new tab)
10:40 – 11:00 CEST
SPE-L11: Speech Separation and Extraction I: Single Channel
An Online Speaker-Aware Speech Separation Approach Based on Time-Domain Representation (opens in new tab)
Hui Wang, Yan Song, Zeng-Xi Li, Ian McLoughlin, Li-Rong Dai
11:30 – 13:30 CEST
SPE-P12: Machine Learning for Speech Synthesis II
Improving LPCNET-Based Text-to-Speech with Linear Prediction-Structured Mixture Density Network (opens in new tab)
Min-Jae Hwang, Eunwoo Song, Ryuichi Yamamoto, Frank Soong (opens in new tab), Hong-Goo Kang
SPE-P13: Speech Separation and Extraction III
Continuous Speech Separation: Dataset and Analysis (opens in new tab)
Zhuo Chen, Takuya Yoshioka (opens in new tab), Liang Lu, Tianyan Zhou, Zhong Meng, Yi Luo, Jian Wu, Xiong Xiao, Jinyu Li (opens in new tab)
12:10 – 12:30 CEST
SPE-L12: Speech Separation and Extraction II: Multi-channel
End-to-End Microphone Permutation and Number Invariant Multi-Channel Speech Separation (opens in new tab)
Yi Luo, Zhuo Chen, Nima Mesgarani, Takuya Yoshioka (opens in new tab)
16:30 – 18:30 CEST
MMSP-P3: Multimedia Signal Processing
Supervised Deep Hashing for Efficient Audio Event Retrieval (opens in new tab)
Arindam Jati, Dimitra Emmanouilidou (opens in new tab)
MMSP-P3: Multimedia Signal Processing
Multimodal Active Speaker Detection and Virtual Cinematography for Video Conferencing (opens in new tab)
Ross Cutler, Ramin Mehran, Sam Johnson, Cha Zhang, Adam Kirk, Oliver Whyte, Adarsh Kowdle
SPE-P15: Speech Recognition: Adaptation
L-Vector: Neural Label Embedding for Domain Adaptation (opens in new tab)
Zhong Meng, Hu Hu, Jinyu Li (opens in new tab), Changliang Liu, Yan Huang, Yifan Gong (opens in new tab), Chin-Hui Lee
SPE-P15: Speech Recognition: Adaptation
Acoustic Model Adaptation for Presentation Transcription and Intelligent Meeting Assistant Systems (opens in new tab)
Yan Huang, Yifan Gong (opens in new tab)
SPE-P15: Speech Recognition: Adaptation
Using Personalized Speech Synthesis and Neural Language Generator for Rapid Speaker Adaptation (opens in new tab)
Yan Huang, Lei He, Wenning Wei, William Gale, Jinyu Li (opens in new tab), Yifan Gong (opens in new tab)
SS-P1: Signal Processing Education: Trends and Innovations
A Dataset for Measuring Reading Levels in India at Scale (opens in new tab)
Dolly Agarwal, Jayant Gupchup, Nishant Baghel
17:30 – 17:30 CEST
IDSP-L2: Industry Session on Large-Scale Distributed Learning Strategies
Parallelizing Adam Optimizer with Blockwise Model-Update Filtering (opens in new tab)
Kai Chen, Haisong Ding, Qiang Huo
Friday, May 8
8:00 – 10:00 CEST
IFS-P1: Information Hiding, Biometrics and Security
Texception: A Character/Word-Level Deep Learning Model for Phishing URL Detection (opens in new tab)
Farid Tajaddodianfar, Jack W. Stokes (opens in new tab), Arun Gururajan
SAM-P6: Detection, Estimation and Classification
Static Visual Spatial Priors For DOA Estimation (opens in new tab)
Pawel Swietojanski, Ondrej Miksik
SPE-P16: Word Spotting
Adaptation of RNN Transducer with Text-to-Speech Technology for Keyword Spotting (opens in new tab)
Eva Sharma, Guoli Ye, Wenning Wei, Rui Zhao, Yao Tian, Jian Wu, Lei He, Ed Lin, Yifan Gong (opens in new tab)
SPE-P17: Speech Enhancement IV
AV(SE) ²: Audio-Visual Squeeze-Excite Speech Enhancement (opens in new tab)
Michael Iuzzolino, Kazuhito Koishida (opens in new tab)
8:20 – 8:40 CEST
HLT-L2: Language Modeling
Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers (opens in new tab)
Junhao Xu, Xie Chen, Shoukang Hu, Jianwei Yu, Xunying Liu, Helen Mei-Ling Meng
9:40 – 10:00 CEST
MLSP-L10: Deep Neural Network Structures
Neural Attentive Multiview Machines (opens in new tab)
Oren Barkan, Ori Katz, Noam Koenigstein
11:45 – 13:45 CEST
AUD-P11: Signal Enhancement and Restoration II
Geometrically Constrained Independent Vector Analysis for Directional Speech Enhancement (opens in new tab)
Li Li, Kazuhito Koishida (opens in new tab)
AUD-P11: Signal Enhancement and Restoration II
Weighted Speech Distortion Losses for Neural-Network-Based Real-Time Speech Enhancement (opens in new tab)
Yangyang Xia, Sebastian Braun, Chandan Reddy, Harishchandra Dubey, Ross Cutler, Ivan Tashev
HLT-P5: Multilingual Processing of Language
Addressing Accent Mismatch in Mandarin-English Code-Switching Speech Recognition (opens in new tab)
Zhili Tan, Xinghua Fan, Hui Zhu, Ed Lin
IFS-P2: Anonymization, Security and Privacy
Detection of Malicious VSCRIPT Using Static and Dynamic Analysis with Recurrent Deep Learning (opens in new tab)
Jack W. Stokes (opens in new tab), Rakshit Agrawal, Geoff McDonald
SPE-P19: Machine Learning for Speech Synthesis III
ESPNET-TTS: Unified, Reproducible, and Integartable Open Source End-to-End Text-to-Speech Toolkit (opens in new tab)
Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, Xu Tan (opens in new tab)
SPE-P20: Speech Recognition: Acoustic Modelling II
High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model (opens in new tab)
Jinyu Li (opens in new tab), Rui Zhao, Eric Sun, Jeremy Wong, Amit Das, Zhong Meng, Yifan Gong (opens in new tab)
12:25 – 12:45 CEST
SPE-L16: Speaker Diarization
Speaker Diarization with Session-Level Speaker Embedding Refinement Using Graph Neural Networks (opens in new tab)
Jixuan Wang, Xiong Xiao, Jian Wu, Ranjani Ramamurthy (opens in new tab), Frank Rudzicz, Michael Brudno
13:05 – 13:25 CEST
SPE-L16: Speaker Diarization
A Memory Augmented Architecture for Continuous Speaker Identification in Meetings (opens in new tab)
Nikolaos Flemotomos, Dimitrios Dimitriadis (opens in new tab)
15:15 – 17:15 CEST
SPE-P21: Voice Conversion
An Improved Frame-Unit-Selection Based Voice Conversion System Without Parallel Training Data (opens in new tab)
Feng-Long Xie, Xin-Hui Li, Bo Liu, Yi-Bin Zheng, Li Meng, Li Lu, Frank K. Soong (opens in new tab)
16:15 – 16:30 CEST
MLSP-L11: Attention Needs
Attentive Item2vec: Neural Attentive User Representations (opens in new tab)
Oren Barkan, Avi Caciularu, Ori Katz, Noam Koenigstein