Google has developed a new artificial intelligence tool that, it claims, is capable of identifying an individual’s voice in an otherwise noisy busy crowd.
Through the so-called ‘cocktail party effect’, scientists have long believed that humans are adept at listening for the voice of a particular person in a noisy environment.
The ability to mentally mute voices and sounds in crowded environments is apparently a natural ability to humans, but a lot harder for electronics.
Although this area has been well studied over the last few years, Google admits that automatic speech separation – sorting audio signals into individual speech – is still a “significant challenge” for machines.
Our method works on ordinary videos with a single audio track, and all that is required from the user is to select the face of the person in the video they want to hear
However, this could be set to change. The company claims to have developed what it calls a “deep learning audio-visual model” that it says is capable of “isolating a single speech signal from a mixture of sounds such as other voices and background noise”.
In a blog post penned by Google software engineers Inbar Mosseri and Oran Lang, they introduce a new AI method that, they claim, can produce videos where “speech of specific people is enhanced while all other sounds are suppressed”.
“Our method works on ordinary videos with a single audio track, and all that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context,” they explain.
The researchers said this technology could be used in a plethora of applications, including enhanced speech recognition in videos and improved hearing aids that could be used in “situations where there are multiple people speaking”.
One of the defining aspects of this techniques is that it combines auditory and visual signals of one context into separate speech.
The researchers added: “Intuitively, movements of a person’s mouth, for example, should correlate with the sounds produced as that person is speaking, which in turn can help identify which parts of the audio correspond to that person.”
“The visual signal not only improves the speech separation quality significantly in cases of mixed speech (compared to speech separation using audio alone, as we demonstrate in our paper), but, importantly, it also associates the separated, clean speech tracks with the visible speakers in the video.”