Successfully developed AI headset that can directly interpret multiple speakers at the same time

Google's Pixel Buds wireless earbuds have long offered impressive real-time translation capabilities, and brands like Timkettle have launched similar earbuds for business customers over the past few years. However, all of these solutions have a common limitation: they can only process one audio stream at a time for translation.

A team at the University of Washington (UW) has developed a device that can overcome this drawback: AI headphones that can translate the voices of multiple people at once. Imagine a multilingual interpreter in a crowded bar, able to understand the speech of people around them even if they speak different languages, all at the same time.

The team calls their invention Spatial Speech Translation, and it works on binaural headphones. For those unfamiliar, binaural audio simulates the sound effect the human ear naturally perceives. To capture this effect, microphones are placed on a mannequin's head, about the same distance apart as two human ears.

This approach is important because the human ear not only hears sound but also helps determine the direction in which it is coming from. The overall goal is to create a natural soundstage with a stereo effect, giving the feeling of listening to a live concert. In modern contexts, this is called spatial listening. This work comes from a research team led by Professor Shyam Gollakota

Multi-speaker translation mechanism

'For the first time, we preserved the timbre of each person's voice and the direction in which the sound came from,' explains Gollakota — now in the UW's Paul G. Allen School of Computer Science & Engineering.

The team compares their system to radar: it starts by determining how many people are speaking in the environment and updates that number in real time as people move in and out of hearing range. The entire process works on the device, without sending speech data to a cloud server for translation — ensuring privacy.

In addition to voice translation, the system also 'maintains the expression and volume of each speaker.' It also adjusts the direction and intensity of the sound as the speaker moves around the room. Notably, Apple is also said to be developing a system that would allow AirPods to translate audio in real time.

How is the system created?

The UW team tested the AI ​​headset's translation capabilities in nearly a dozen indoor and outdoor environments. In terms of performance, the system was able to receive, process, and produce translated audio within 2-4 seconds. Testers preferred a latency of around 3-4 seconds, but the team is working to speed up the translation process.

For now, the team has only tested Spanish, German, and French translations, but they hope to add more languages. Technically, they've integrated blind source separation, localization, real-time expression translation, and stereoscopic rendering into a single workflow—a remarkable feat.

On the hardware side, the team developed a real-time speech translation model running on the Apple M2 chip, achieving instant inference. The recording task was handled by Sony WH-1000XM4 noise-canceling headphones and Sonic Presence SP15C stereo USB microphone.

Notably, 'The source code for the prototype device has been made open source for continued development by the community,' the UW press release said. This means the scientific and open source communities can learn and build on the foundation the UW team has established for more advanced projects.

Update 15 May 2025
Category

System

Mac OS X

Hardware

Game

Tech info

Technology

Science

Life

Application

Electric

Program

Mobile