Language modeling is defining a joint probability distribution over a sequence of tokens (words or characters). Considering a sequence of tokens fx**_{1}**; :::; x

_{T}g. A language model defines P (x

**; : : : ; x**

_{1}_{T}), which can be used in many areas of natural language processing. Language modelings define a joint probability distribution over a sequence of tokens (words or characters). Consider a sequence of tokens x

**; : : : ; x**

_{1}_{T}.

For example, a language model can significantly improve the accuracy of a speech recognition system. As an example, in the case of two words that have the same sound but different meanings, a language model can fix the problem of recognizing the right word. In Figure 1, the speech recognizer (aka acoustic model) has assigned the same high probabilities to the words meet” and meat”. It is even possible that the speech recognizer assigns a higher probability to meet” rather than meat”. However, by conditioning the language model on the three rst tokens (I-cooked-some”), the next word could be sh”, pasta”, or meat” with a reasonable probability higher than the probability of meet”. To get the final answer, we can simply multiply two tables of probabilities and normalize them. Now the word meat” has a very high relative probability!

One family of deep learning models that are capable of modeling sequential data (such as language) is Recurrent Neural Networks (RNNs). RNNs have recently achieved impressive results on different problems such as the language modeling. In this article, we briefly describe RNNs and demonstrate how to code them using the Blocks library on top of Theano.

Consider a sequence of T input elements x**_{1}**; : : : ; x

_{T}. RNN models the sequence by applying the same operation in a recursive way. Formally,

h |
(1) |

y |
(2) |

Where h_{t} is the internal hidden representation of the RNN and y_{t} is the output at t^{th} time-step. For the very first time-step, we also have an initial state h**_{0}**. f and g are two functions, which are shared across the time axis. In the simplest case, f and g can be a linear transformation followed by a non-linearity. There are more complicated forms of f and g such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). Here we skip the exact formulations of f and g to use LSTM as a black box. Consequently, suppose we have B sequences, each with a length of T, such that each time-step is presented in a vector of size F . So the input can be seen as a 3D tensor with size T xBxF, the hidden representation with size T xBxF

^{0}, and the output with size T xBxF

^{00}.

Let’s build a character-level language model that can model the joint probability P (x**_{1}**; : : : ; x

_{T}) using the chain rule:

P (x _{T} ) = P (x)P (x_{1}jx_{2})P (x_{1}jx_{3}; x_{1}):::P (x_{2}_{T} jx_{1}_{::T})_{1} |
(3) |

We can model P (x_{t}jx_{1}_{::t}**_{1}**) using an RNN by predicting x

_{t}given x

_{t}

_{1}_{::1}. In other words, given a sequence fx

**; : : : ; x**

_{1}_{T}g, the input sequence is fx

**; : : : ; x**

_{1}_{T}

**g and the target sequence is fx**

_{1}**; : : : ; x**

_{2}_{T}g. To define input and target, we can write:

Now to define the model, we need a linear transformation from the input to the LSTM, and from the LSTM to the output. To train the model, we use the

cross entropy between the model output and the true target:

Now assuming that data is provided to us, using data stream, we can start training by initializing the model, and tuning parameters:

After the model is trained, we can condition the model on an initial sequence and start generating the next token. We can repeatedly feed the predicted token into the model and get the next token. We can even just start from the initial state and ask the model to hallucinate! Here is a sample generated text from a model trained on a 96 MB text data of wikipedia (figure adapted from here):

Here is a visualization of the model’s output. The first line is the real data and the next six lines are the candidate with the highest output probability of for each character. The more red a cell is, the higher probability the model assigns to that character. For example, as soon as the model sees ttp://ww, it is confident that the next character is also a w” and the next one is a .”. Butat this point, there is no more clue about the next character. So the model assigns almost the same probability to all the characters (figure adapted from here):

In this post we learned about language modeling and one of its applications in speech recognition. We also learned how to code a recurrent neural network in order to train such a model. You can find the complete code and experiment on a bunch of datasets such as wikipedia at Github. The code is written by my close friend Eloi Zablocki and me.

### About the author

Mohammad Pezeshk is a master’s student in the LISA lab of Universite de Montreal working under the supervision of Yoshua Bengio and Aaron Courville. He obtained his bachelor’s in computer engineering from Amirkabir University of Technology (Tehran Polytechnic) in July 2014 and then started his master’s in September 2014. His research interests lie in the fields of artifitial intelligence, machine learning, probabilistic models and specifically deep learning.