6 min read

While watching your favorite movie or TV show, you must have found it difficult to sometimes decipher what the characters are saying, especially if they are talking really fast, or well, you’re seeing a show in the language you don’t know. You quickly add subtitles and voila, the problem is solved. But, do you know how these subtitles work? Instead of a person writing them, a computer automatically recognizes speech and the dialogues of the characters and generates scripts.

However, this is just a trivial example of what computers and neural networks can do in the field of speech understanding and generation. Today, we’re gonna talk about the achievements of deep neural networks to improve the ability of our computing systems to understand and generate human speech.

How traditional speech recognition systems work

Traditionally speech recognition models used classification algorithms to arrive at a distribution of possible phonemes for each frame. These classification algorithms were based on highly specialized features such as MFCC. Hidden Markov Models (HMM) were used in the decoding phase. This model was accompanied with a pre-trained language model and was used to find the most likely sequence of phones that can be mapped to output words.

With the emergence of deep learning, neural networks were used in many aspects of speech recognition such as phoneme classification, isolated word recognition,  audiovisual speech recognition, audio-visual speaker recognition and speaker adaptation.

Deep learning enabled the development of Automatic Speech Recognition (ASR) systems. These ASR systems require separate models, namely acoustic model (AM), a pronunciation model (PM) and a language model (LM).  The AM is typically trained to recognize context-dependent states or phonemes, by bootstrapping from an existing model which is used for alignment. The PM maps the sequences of phonemes produced by the AM into word sequences. Word sequences are scored using LM trained on large amounts of text data, which estimate probabilities of word sequences.

However, training independent components added complexities and was suboptimal compared to training all components jointly.

This called for developing end-to-end systems in the ASR community, those which attempt to learn the separate components of an ASR jointly as a single system.

A single system Speech recognition model

The end-to-end trained neural networks can essentially recognize speech, without using an external pronunciation lexicon, or a separate language model. End-to-end trained systems can directly map the input acoustic speech signal to word sequences. In such sequence-to-sequence models, the AM, PM, and LM are trained jointly in a single system. Since these models directly predict words, the process of decoding utterances is also greatly simplified. The end-to-end ASR systems do not require bootstrapping from decision trees or time alignments generated from a separate system. Thereby making the training of such models simpler than conventional ASR systems.

There are several sequence-to-sequence models including connectionist temporal classification (CTC), and recurrent neural network (RNN) transducer, an attention-based model etc.

CTC models are used to train end-to-end systems that directly predict grapheme sequences. This model was proposed by Graves et al. as a way of training end-to-end models without requiring a frame-level alignment of the target labels for a training statement.  This basic CTC model was extended by Graves to include a separate recurrent LM component, in a model referred to as the recurrent neural network (RNN) transducer.  The RNN transducer augments the encoder network from the CTC model architecture with a separate recurrent prediction network over the output symbols. Attention-based models are also a type of end-to-end sequence models. These models consist of an encoder network, which maps the input acoustics into a higher-level representation. They also have an attention-based decoder that predicts the next output symbol based on the previous predictions.

A schematic representation of various sequence-to-sequence modeling approaches

Google’s Listen-Attend-Spell (LAS) end-to-end architecture is one such attention-based model. Their end-to-end system achieves a word error rate (WER) of 5.6%, which corresponds to a 16% relative improvement over a strong conventional system which achieves a 6.7% WER. Additionally, the end-to-end model used to output the initial word hypothesis, before any hypothesis rescoring, is 18 times smaller than the conventional model.

These sequence-to-sequence models are comparable with traditional approaches on dictation test sets. However, the traditional models outperform end-to-end systems on voice-search test sets.

Future work is being done on building optimal models for voice-search tests as well. More work is also expected in building multi-dialect and multi-lingual systems. So that data for all dialects/languages can be combined to train one network, without the need for a separate AM, PM, and LM for each dialect/language.

Enough with understanding speech. Let’s talk about generating it

Text-to-speech (TTS) conversion, i.e generating natural sounding speech from text, or allowing people to converse with machines has been one of the top research goals in the present times. Deep Neural networks have greatly improved the overall development of a TTS system, as well as enhanced individual pieces of such a system.

In 2012, Google first used Deep Neural Networks (DNN) instead of Gaussian Mixture Model (GMMs), which were then used as the core technology behind TTS systems. DNNs assessed sounds at every instant in time with increased speech recognition accuracy.  Later, better neural network acoustic models were built using CTC and sequence discriminative training techniques based on RNNs. Although being blazingly fast and accurate,  these TTS systems were largely based on concatenative TTS, where a very large database of short speech fragments was recorded from a single speaker and then recombined to form complete utterances.

This led to the development of parametric TTS, where all the information required to generate the data was stored in the parameters of the model, and the contents and characteristics of the speech were controlled via the inputs to the model. WaveNet further enhanced these parametric models by directly modeling the raw waveform of the audio signal, one sample at a time. WaveNet yielded more natural-sounding speech using raw waveforms and was able to model any kind of audio, including music.

Baidu then came with their Deep Voice TTS system constructed entirely from deep neural networks. Their system was able to do audio synthesis in real-time, giving up to 400X speedup over previous WaveNet inference implementations.

Google, then released Tacotron, an end-to-end generative TTS model that synthesized speech directly from characters. Tacotron was able to achieve a 3.82 mean opinion score (MOS), outperforming the traditional parametric system in terms of speech naturalness. Tacotron was also considerably faster than sample-level autoregressive methods because of its ability to generate speech at the frame level.

Most recently, Google has released Tacotron 2 which took inspiration from past work on Tacotron and WaveNet. It features a tacotron style, recurrent sequence-to-sequence feature prediction network that generates mel spectrograms. Followed by a modified version of WaveNet which generates time-domain waveform samples conditioned on the generated mel spectrogram frames. The model achieved a MOS of 4.53 compared to a MOS of 4.58 for professionally recorded speech.


Deep Neural Networks have been a strong force behind the developments of end-to-end speech recognition and generation models. Although these end-to-end models have compared substantially well against the classical approaches, more work is to be done still.

As of now, end-to-end speech models cannot process speech in real time. Real-time speech processing is a strong requirement for latency-sensitive applications such as voice search. Hence more progress is expected in such areas.

Also, end-to-end models do not give expected results when evaluated on live production data. There is also difficulty in learning proper spellings for rarely used words such as proper nouns. This is done quite easily when a separate PM is used. More efforts will need to be made to address these challenges as well.

Subscribe to the weekly Packt Hub newsletter

* indicates required


Please enter your comment!
Please enter your name here