AI can now help speak your mind: UC researchers introduce a neural decoder that translates brain signals to natural-sounding speech

In a research published in the Nature journal on Monday, a team of neuroscientists from the University of California, San Francisco, introduced a neural decoder that can synthesize natural-sounding speech based on brain activity.

This research was led by Gopala Anumanchipalli, a speech scientist, and Josh Chartier, a bioengineering graduate student in the Chang lab. It is being developed in the laboratory of Edward Chang, a Neurological Surgery professor at University of California.

Why is this neural decoder being introduced?

There are many cases of people losing their voice because of stroke, traumatic brain injury, or neurodegenerative diseases such as Parkinson’s disease, multiple sclerosis, and amyotrophic lateral sclerosis.

Currently,assistive devices that track very small eye or facial muscle movements to enable people with severe speech disabilities express their thoughts by writing them letter-by-letter, do exist. However, generating text or synthesized speech with such devices is often time consuming, laborious, and error-prone. Another limitation these devices have is that they only permit generating a maximum of 10 words per minute, compared to the 100 to 150 words per minute of natural speech.

This research shows that it is possible to generate a synthesized version of a person’s voice that can be controlled by their brain activity. The researchers believe that in future, this device could be used to enable individuals with severe speech disability to have fluent communication. It could even reproduce some of the “musicality” of the human voice that expresses the speaker’s emotions and personality.

“For the first time, this study demonstrates that we can generate entire spoken sentences based on an individual’s brain activity,” said Chang. “This is an exhilarating proof of principle that with technology that is already within reach, we should be able to build a device that is clinically viable in patients with speech loss.”

How does this system work?

This research is based on another study by Josh Chartier and Gopala K. Anumanchipalli, which shows how the speech centers in our brain choreograph the movements of the lips, jaw, tongue, and other vocal tract components to produce fluent speech.

In this new study, Anumanchipalli and Chartier asked five patients being treated at the UCSF Epilepsy Center to read several sentences aloud. These patients had electrodes implanted into their brains to map the source of their seizures in preparation for neurosurgery. Simultaneously, the researchers recorded activity from a brain region known to be involved in language production.

The researchers used the audio recordings of volunteer’s voice to understand the vocal tract movements needed to produce those sounds. With this detailed map of sound to anatomy in hand, the scientists created a realistic virtual vocal tract for each volunteer that could be controlled by their brain activity.

The system comprised of two neural networks:

A decoder for transforming brain activity patterns produced during speech into movements of the virtual vocal tract.

A synthesizer for converting these vocal tract movements into a synthetic approximation of the volunteer’s voice.

Here’s a video depicting the working of this system:

https://www.youtube.com/watch?v=kbX9FLJ6WKw&feature=youtu.be

The researchers observed that the synthetic speech produced by this system was much better as compared to the synthetic speech directly decoded from the volunteer’s brain activity. The generated sentences were also understandable to hundreds of human listeners in crowdsourced transcription tests conducted on the Amazon Mechanical Turk platform.

The system is still in its early stages. Explaining its limitations, Chartier said, “We still have a ways to go to perfectly mimic spoken language. We’re quite good at synthesizing slower speech sounds like ‘sh’ and ‘z’ as well as maintaining the rhythms and intonations of speech and the speaker’s gender and identity, but some of the more abrupt sounds like ‘b’s and ‘p’s get a bit fuzzy. Still, the levels of accuracy we produced here would be an amazing improvement in real-time communication compared to what’s currently available.”

Read the full report on UCSF’s official website.

OpenAI introduces MuseNet: A deep neural network for generating musical compositions

Interpretation of Functional APIs in Deep Neural Networks by Rowel Atienza

Google open-sources GPipe, a pipeline parallelism Library to scale up Deep Neural Network training