Categories: NewsData

AI learns to talk naturally with Google’s Tacotron 2

2 min read

Google has been one of the leading forces in the area of text-to-speech (TTS) conversions. The company has further leaped ahead in this domain with the launch of Tacotron 2. The new technique is a combination of Google’s Wavenet and the original TacotronGoogle’s previous speech generation projects.

WaveNet is a generative model of time domain waveforms. It produces natural sounding audio fidelity and is already used in some complete TTS systems. However, the inputs to WaveNet need significant domain expertise to produce as they require elaborate text-analysis systems and a detailed pronunciation guide.

Tacotron is a sequence-to-sequence architecture for producing magnitude spectrograms from a sequence of characters i.e. it synthesizes speech directly from words. It uses a single neural network trained from data alone for production of the linguistic and acoustic features .Tacotron uses the Griffin-Lim algorithm for phase estimation. Griffin-Lim produces characteristic artifacts and lower audio fidelity than approaches like WaveNet. Although Tacotron was efficient with respect to patterns of rhythm and sound, it wasn’t actually suited for producing a final speech product.

Tacotron 2 is a conjunction of the above described approaches. It features a tacotron style, recurrent sequence-to-sequence feature prediction network that generates mel spectrograms. Followed by a modified version of WaveNet which generates time-domain waveform samples conditioned on the generated mel spectrogram frames.

Source: https://arxiv.org/pdf/1712.05884.pdf

In contrast to Tacotron, Tacotron 2 uses simpler building blocks, using vanilla LSTM and convolutional layers in the encoder and decoder. Also, each decoder step corresponds to a single spectrogram frame.

The original WaveNet used linguistic features, phoneme durations, and log F0 at a frame rate of 5 ms. However, these lead to significant pronunciation issues when predicting spectrogram frames spaced this closely. Hence, the WaveNet architecture used in Tacotron 2  work with 12.5 ms feature spacing by using only 2 upsampling layers in the transposed convolutional network.

Here’s how it works:

  • Tacotron 2 uses a sequence-to-sequence model optimized for TTS in order to map a sequence of letters to a sequence of features that encode the audio.
  • These sequence of features include an 80-dimensional audio spectrogram with frames computed every 12.5 milliseconds. They are used for capturing word pronunciations, and various other qualities of human speech such as volume, speed and pitch.
  • Finally, these features are converted to a waveform of 24 kHz using a WaveNet-like architecture.

Tacotron 2 system can be trained directly from data without relying on complex feature engineering. It achieves state-of-the-art sound quality close to that of natural human speech. Their model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. Google has also provided some Tacotron 2 audio samples that demonstrate the results of their TTS system.

In the future, Google would work on improving their system to pronounce complex words, generate audio in realtime, and directing a generated speech to sound happy or sad.

The entire paper is available for reading at Arxiv archives here.

Sugandha Lahoti

Content Marketing Editor at Packt Hub. I blog about new and upcoming tech trends ranging from Data science, Web development, Programming, Cloud & Networking, IoT, Security and Game development.

Share
Published by
Sugandha Lahoti

Recent Posts

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago