Home Data News Google AI engineers introduce Translatotron, an end-to-end speech-to-speech translation model

Google AI engineers introduce Translatotron, an end-to-end speech-to-speech translation model

May 17, 2019 - 6:49 am

2551

3 min read

Just two days ago, the research team at Google AI introduced Translatotron, an end to end, speech to speech translation model. In their research paper, “Direct speech-to-speech translation with a sequence-to-sequence model” they demonstrated the Translatotron and realized that the model achieves high translation quality on two Spanish-to-English datasets.

Speech-to-speech translation systems have usually been broken into three separate components:

Automatic speech recognition: It used to transcribe the source speech as text.
Machine translation: It is used for translating the transcribed text into the target language
Text-to-speech synthesis (TTS): It is used to generate speech in the target language from the translated text.

Dividing the task into such systems have been working successfully and have powered many commercial speech-to-speech translation products, including Google Translate.

In 2016, most of the engineers and researchers realized the need for end-to-end models on speech translation when researchers demonstrated the feasibility of using a single sequence-to-sequence model for speech-to-text translation.

In 2017, the Google AI team demonstrated that such end-to-end models can outperform cascade models. Recently, many approaches for improving end-to-end speech-to-text translation models have been proposed.

Translatotron demonstrates that a single sequence-to-sequence model can directly translate speech from one language into another. Also, it doesn’t rely on an intermediate text representation in either language, as required in cascaded systems. It is based on a sequence-to-sequence network that takes source spectrograms as input and then generates spectrograms of the translated content in the target language.

Translatotron also makes use of two separately trained components: a neural vocoder that converts output spectrograms to time-domain waveforms and a speaker encoder, which is used to maintain the source speaker’s voice in the synthesized translated speech.

The sequence-to-sequence model uses a multitask objective for predicting source and target transcripts and generates target spectrograms during training. But during the inference, no no transcripts or other intermediate text representations are used.

The engineers at Google AI validated Translatotron’s translation quality by measuring the BLEU (bilingual evaluation understudy) score, computed with text transcribed by a speech recognition system.

The results do lag behind a conventional cascade system but the engineers have managed to demonstrate the feasibility of the end-to-end direct speech-to-speech translation.

Translatotron can retain the original speaker’s vocal characteristics in the translated speech by incorporating a speaker encoder network. This makes the translated speech sound natural and less jarring. According to the Google AI team, the Translatotron gives more accurate translation than the baseline cascade model, while retaining the original speaker’s vocal characteristics.

The engineers concluded that Translatotron is the first end-to-end model that can directly translate speech from one language into speech in another language and can retain the source speaker’s voice in the translated speech.

To know more about this news, check out the blog post by Google AI.

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

Google AI engineers introduce Translatotron, an end-to-end speech-to-speech translation model

Read Next

MobilePro

datapro

Programming

Subscribe to our newsletter