Binaural spatial audio positioning in video calls
Spatially separating voices plays a crucial role in speech intelligibility, speaker identification and cognitive load in conversations. Voices are naturally separated in in-person conversations, but in most video conferencing software voices are mixed down to one channel. Spatial audio has been shown to improve user experience in audio teleconferencing. However, while voice streams in audio-only calls can be positioned anywhere in three-dimensions around a listener’s head without concern for visual stimuli, video calls display corresponding speakers in a narrow 2D plane on the listener’s screen. Audio/visual mismatch may cause discomfort or disorientation for the listener while a tightly coupled pairing of audio and video positioning may result in insufficient spacing between audio streams to show previously discovered benefits. Therefore, the ideal stream placement for audio in video call software remains an open question.
To better understand the remote user experience with spatial audio and video calls, we conducted a user study focused on user preference and stream identification depending on the width of the spatial audio stage. Participants used their laptops and headphones to watch videos simulating videos calls between either two or four speakers, with four levels of horizontal spread per video set: no spread (i.e. diotic playback without spatialization), narrow, medium, and wide. Increased spatial spread was found to rate higher in audio and visual correspondence, as well as their ability and confidence to identify specific audio streams. However, spatialization benefits plateaued for the wider spreads tested, with the four-speaker condition benefiting from a wider audio stage than the two-speaker condition. The results indicate that spreading audio streams spatially in video calls has listener benefits. Feedback from an open-ended post-study questionnaire suggests that some listeners prefer a narrower audio stage that corresponds more strongly with visuals when there are only two active speakers, while for four speakers some listeners prefer a wider audio stage that may increase intelligibility.
- Date: