Knowledge boosting during low-latency inference

Tuochao Chen; Malek Itani; Vidya Srinivas; Sefik Emre Eskimez; Takuya Yoshioka; Shyamnath Gollakota

Knowledge boosting during low-latency inference

Tuochao Chen ,
Malek Itani ,
Vidya Srinivas ,
Sefik Emre Eskimez ,
Takuya Yoshioka ,
Shyamnath Gollakota

Interspeech 2024 | September 2024

Organized by ISCA

Download BibTex

Models for low-latency, streaming applications could benefit from the knowledge capacity of larger models, but edge devices cannot run these models due to resource constraints. A possible solution is to transfer hints during inference from a large model running remotely to a small model running on-device. However, this incurs a communication delay that breaks real-time requirements and does not guarantee that both models will operate on the same data at the same time. We propose knowledge boosting, a novel technique that allows a large model to operate on time-delayed input during inference, while still boosting small model performance. Using a streaming neural network that processes 8~ms chunks, we evaluate different speech separation and enhancement tasks with communication delays of up to six chunks or 48~ms. Our results show larger gains where the performance gap between the small and large models is wide, demonstrating a promising method for large-small model collaboration for low-latency applications. Code, dataset, and audio samples available at https://knowledgeboosting.cs.washington.edu/ (opens in new tab)