3 min read
Speech-to-speech translation systems have usually been broken into three separate components:
- Automatic speech recognition: It used to transcribe the source speech as text.
- Machine translation: It is used for translating the transcribed text into the target language
- Text-to-speech synthesis (TTS): It is used to generate speech in the target language from the translated text.
Dividing the task into such systems have been working successfully and have powered many commercial speech-to-speech translation products, including Google Translate.
In 2016, most of the engineers and researchers realized the need for end-to-end models on speech translation when researchers demonstrated the feasibility of using a single sequence-to-sequence model for speech-to-text translation.
In 2017, the Google AI team demonstrated that such end-to-end models can outperform cascade models. Recently, many approaches for improving end-to-end speech-to-text translation models have been proposed.
Translatotron demonstrates that a single sequence-to-sequence model can directly translate speech from one language into another. Also, it doesn’t rely on an intermediate text representation in either language, as required in cascaded systems. It is based on a sequence-to-sequence network that takes source spectrograms as input and then generates spectrograms of the translated content in the target language.
Translatotron also makes use of two separately trained components: a neural vocoder that converts output spectrograms to time-domain waveforms and a speaker encoder, which is used to maintain the source speaker’s voice in the synthesized translated speech.
The sequence-to-sequence model uses a multitask objective for predicting source and target transcripts and generates target spectrograms during training. But during the inference, no no transcripts or other intermediate text representations are used.
The engineers at Google AI validated Translatotron’s translation quality by measuring the BLEU (bilingual evaluation understudy) score, computed with text transcribed by a speech recognition system.
The results do lag behind a conventional cascade system but the engineers have managed to demonstrate the feasibility of the end-to-end direct speech-to-speech translation.
Translatotron can retain the original speaker’s vocal characteristics in the translated speech by incorporating a speaker encoder network. This makes the translated speech sound natural and less jarring. According to the Google AI team, the Translatotron gives more accurate translation than the baseline cascade model, while retaining the original speaker’s vocal characteristics.
The engineers concluded that Translatotron is the first end-to-end model that can directly translate speech from one language into speech in another language and can retain the source speaker’s voice in the translated speech.
To know more about this news, check out the blog post by Google AI.