Introducing ESPRESSO, an open-source, PyTorch based, end-to-end neural automatic speech recognition (ASR) toolkit for distributed training across GPUs

Last week, researchers from USA and China released a paper titled ESPRESSO: A fast end-to-end neural speech recognition toolkit. In the paper, the researchers have introduced ESPRESSO, an open-source, modular, end-to-end neural automatic speech recognition (ASR) toolkit. This toolkit is based on PyTorch library and FAIRSEQ, the neural machine translation toolkit.

This toolkit supports distributed training across GPUs and computing nodes and decoding approaches that are commonly employed in ASR such as look-ahead word-based language model fusion.

ESPRESSO is 4 to 11 times faster for decoding than similar systems like ESPNET and it achieves state-of-the-art ASR performance on data sets such as LibriSpeech, WSJ, and Switchboard.

Limitations of ESPnet

ESPnet, an end-to-end speech processing toolkit, has some limitations:

The code ESPnet is not easily extensible and also has issues related to portability due to its mixed dependency on PyTorch and Chainer, the deep learning frameworks.

It uses a decoder which is based on a slow beam search algorithm that is not fast enough for quick turnaround of experiments.

To address the above problems, the researchers introduced ESPRESSO. With ESPRESSO it is possible to plug new modules into the system by extending standard PyTorch interfaces.

The research paper reads, “We envision that ESPRESSO could become the foundation for unified speech + text processing systems, and pave the way for future end-to-end speech translation (ST) and text-to-speech synthesis (TTS) systems, ultimately facilitating greater synergy between the ASR and NLP research communities.”

ESPRESSO is built on design goals

The researchers implemented ESPRESSO based on certain design goals in mind. Firstly, they made use of pure Python / PyTorch for enabling modularity and extensibility. To speed up the experiments, the researchers implemented parallelization, distributed training and decoding. They achieved compatibility with Kaldi / ESPNET data format in order to reuse previous / proven data preparation pipelines. They made ESPRESSO exhibit interoperability with the existing FAIRSEQ codebase in order to make future joint research areas between speech and NLP, easy.

ESPRESSO’s dataset classes

The speech data for ESPRESSO follows the format in Kaldi, a speech recognition toolkit where utterances get stored in the Kaldi-defined SCP format. The researchers have followed ESPNET and have used the 80-dimensional log Mel feature along with the additional pitch features (83 dimensions for each frame).

ESPRESSO also follows FAIRSEQ’s concept of “datasets” that contains a set of training samples and abstracts. Based on the same concept, the researchers have created dataset classes in ESPRESSO:

data.ScpCachedDataset

This dataset contains the real-valued acoustic features that are extracted from the speech utterance. The training batch that is drawn from this dataset is a real-valued tensor of shape [BatchSize × TimeFrameLength × FeatureDims] and it will be fed to the neural speech encoder. As the acoustic features are large and they cannot be loaded into memory all at once, the researchers also implement sharded loading where bulk of features are pre-loaded once the previous bulk is consumed for training/decoding. This also balances the file system’s I/O load as well as memory usage.

data.TokenTextDataset

This dataset contains the gold speech transcripts as text where the training batches are an integer-valued tensor of shape [BatchSize × SequenceLength].

data.SpeechDataset

data.SpeechDataset is a container for the above-mentioned datasets. The samples drawn from this dataset contain two fields including source and target and points to the speech utterance and gold transcripts respectively.

Achieving state-of-the-art ASR performance on LibriSpeech, WSJ, and Switchboard datasets

ESPRESSO provides running recipes for a variety of data sets. The researchers have given details about their recipes on Wall Street Journal (WSJ), an 80-hour English newspaper speech corpus, Switchboard (SWBD), a 300-hour English telephone speech corpus and LibriSpeecha corpus which is of approximately 1,000 hours of English speech.

The data sets for ESPRESSO have their own extra text corpus that is used for training language models. These are models are optimized using Adam, a method used for stochastic optimization, with an initial learning rate 10−3. This rate is halved if the metric on the validation set at the end of an epoch does not show an improvement over the previous epoch. In case, the learning rate is less than 10−5, Also the training process stops.

Curriculum learning is used for LibriSpeech or WSJ / SWBD epochs, as it prevents training divergence and improves performance. NVIDIA GeForce GTX 1080 Ti GPUs is used for training/evaluating the models. In this paper, all the models are trained with 2 GPUs by using FAIRSEQ built-in distributed data parallellism.

To conclude, the researchers have presented ESPRESSO toolkit in this paper and has provided ASR recipes for LibriSpeech, WSJ, and Switchboard datasets. The paper reads, “By sharing the underlying infrastructure with FAIRSEQ, we hope ESPRESSO will facilitate future joint research in speech and natural language processing, especially in sequence transduction tasks such as speech translation and speech synthesis.”

To know more about ESPRESSO in detail, check out the paper.