Google has been one of the leading forces in the area of text-to-speech (TTS) conversions. The company has further leaped ahead in this domain with the launch of Tacotron 2. The new technique is a combination of Google’s Wavenet and the original Tacotron—Google’s previous speech generation projects.
WaveNet is a generative model of time domain waveforms. It produces natural sounding audio fidelity and is already used in some complete TTS systems. However, the inputs to WaveNet need significant domain expertise to produce as they require elaborate text-analysis systems and a detailed pronunciation guide.
Tacotron is a sequence-to-sequence architecture for producing magnitude spectrograms from a sequence of characters i.e. it synthesizes speech directly from words. It uses a single neural network trained from data alone for production of the linguistic and acoustic features .Tacotron uses the Griffin-Lim algorithm for phase estimation. Griffin-Lim produces characteristic artifacts and lower audio fidelity than approaches like WaveNet. Although Tacotron was efficient with respect to patterns of rhythm and sound, it wasn’t actually suited for producing a final speech product.
Tacotron 2 is a conjunction of the above described approaches. It features a tacotron style, recurrent sequence-to-sequence feature prediction network that generates mel spectrograms. Followed by a modified version of WaveNet which generates time-domain waveform samples conditioned on the generated mel spectrogram frames.
Source: https://arxiv.org/pdf/1712.05884.pdf
In contrast to Tacotron, Tacotron 2 uses simpler building blocks, using vanilla LSTM and convolutional layers in the encoder and decoder. Also, each decoder step corresponds to a single spectrogram frame.
The original WaveNet used linguistic features, phoneme durations, and log F0 at a frame rate of 5 ms. However, these lead to significant pronunciation issues when predicting spectrogram frames spaced this closely. Hence, the WaveNet architecture used in Tacotron 2 work with 12.5 ms feature spacing by using only 2 upsampling layers in the transposed convolutional network.
Here’s how it works:
- Tacotron 2 uses a sequence-to-sequence model optimized for TTS in order to map a sequence of letters to a sequence of features that encode the audio.
- These sequence of features include an 80-dimensional audio spectrogram with frames computed every 12.5 milliseconds. They are used for capturing word pronunciations, and various other qualities of human speech such as volume, speed and pitch.
- Finally, these features are converted to a waveform of 24 kHz using a WaveNet-like architecture.
Tacotron 2 system can be trained directly from data without relying on complex feature engineering. It achieves state-of-the-art sound quality close to that of natural human speech. Their model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. Google has also provided some Tacotron 2 audio samples that demonstrate the results of their TTS system.
In the future, Google would work on improving their system to pronounce complex words, generate audio in realtime, and directing a generated speech to sound happy or sad.
The entire paper is available for reading at Arxiv archives here.