LSTMs are heavily employed for tasks such as text generation and image caption generation. For example, language modeling is very useful for text summarization tasks or generating captivating textual advertisements for products, where image caption generation or image annotation is very useful for image retrieval, and where a user might need to retrieve images representing some concept (for example, a cat).
In this tutorial, we will implement an LSTM which will generate new stories after training on a dataset of folk stories.
This article is extracted from the book Natural Language Processing with Tensorflow by Thushan Ganegedara.
The application that we will cover in this article is the use of an LSTM to generate new text. For this task, we will download translations of some folk stories by the Brothers Grimm. We will use these stories to train an LSTM and ask it at the end to output a fresh new story. We will process the text by breaking it into character level bigrams (n-grams, where n=2) and make a vocabulary out of the unique bigrams.
The code for this article is available on Github.
First, we will discuss the data we will use for text generation and various preprocessing steps employed to clean data.
About the dataset
We will understand what the dataset looks like so that when we see the generated text, we can assess whether it makes sense, given the training data. We will download the first 100 books from the website, https://www.cs.cmu.edu/~spok/grimmtmp/. These are translations of a set of books (from German to English) by the Brothers Grimm.
Initially, we will download the first 100 books in the website, with an automated script, as shown here:
url = ‘https://www.cs.cmu.edu/~spok/grimmtmp/’
# Create a directory if needed dir_name = 'stories' if not os.path.exists(dir_name): os.mkdir(dir_name) def maybe_download(filename): """Download a file if not present""" print('Downloading file: ', dir_name+ os.sep+filename) if not os.path.exists(dir_name+os.sep+filename): filename, _ = urlretrieve(url + filename, dir_name+os.sep+filename)
print('File ',filename, ' already exists.') return filename num_files = 100 filenames = [format(i, '03d')+'.txt' for i in range(1,101)] for fn in filenames: maybe_download(fn)
We will now show example text snippets extracted from two randomly picked stories.
The following is the first snippet:
Then she said, my dearest benjamin, your father has had these coffins made for you and for your eleven brothers, for if I bring a little girl into the world, you are all to be killed and buried in them. And as she wept while she was saying this, the son comforted her and said, weep not, dear mother, we will save ourselves, and go hence…
The second text snippet is as follows:
Red-cap did not know what a wicked creature he was, and was not at all afraid of him.
“Good-day, little red-cap,” said he.
“Thank you kindly, wolf.”
“Whither away so early, little red-cap?”
“To my grandmother’s.”
“What have you got in your apron?”
“Cake and wine. Yesterday was baking-day, so poor sick grandmother is to have something good, to make her stronger.”…
In terms of preprocessing, we will initially make all the text lowercase and break the text into character n-grams, where n=2. Consider the following sentence:
The king was hunting in the forest.
This would break down to a sequence of n-grams, as follows:
[‘th,’ ‘e ,’ ‘ki,’ ‘ng,’ ‘ w,’ ‘as,’ …]
We will use character level bigrams because it greatly reduces the size of the vocabulary compared with using individual words. Moreover, we will be replacing all the bigrams that appear fewer than 10 times in the corpus with a special token (that is, UNK), representing that bigram is unknown. This helps us to reduce the size of the vocabulary even further.
Implementing an LSTM
Though there are sublibraries in TensorFlow that have already implemented LSTMs ready to go, we will implement one from scratch. This will be very valuable, as in the real world there might be situations where you cannot use these off-the-shelf components directly.
We will discuss the hyperparameters and their effects used for the LSTM. Thereafter, we will discuss the parameters (weights and biases) required to implement the LSTM. We will then discuss how these parameters are used to write the operations taking place within the LSTM. This will be followed by understanding how we will sequentially feed data to the LSTM. Next, we will discuss how we can implement the optimization of the parameters using gradient clipping. Finally, we will investigate how we can use the learned model to output predictions, which are essentially bigrams that will eventually add up to a meaningful story.
We will define some hyper-parameters required for the LSTM:
# Number of neurons in the hidden state variables num_nodes = 128 # Number of data points in a batch we process batch_size = 64 # Number of time steps we unroll for during optimization num_unrollings = 50 dropout = 0.2 # We use dropout
The following list describes each of the hyperparameters:
- num_nodes: This denotes the number of neurons in the cell memory state. When data is abundant, increasing the complexity of the cell memory will give you a better performance; however, at the same time, it slows down the computations.
- batch_size: This is the amount of data processed in a single step. Increasing the size of the batch gives a better performance, but poses higher memory requirements.
- num_unrollings: This is the number of time steps used in truncated-BPTT. The higher the num_unrollings steps, the better the performance, but it will increase both the memory requirement and the computational time.
- dropout: Finally, we will employ dropout (that is, a regularization technique) to reduce overfitting of the model and produce better results; dropout randomly drops information from inputs/outputs/state variables before passing them to their successive operations. This creates redundant features during learning, leading to better performance.
Now we will define TensorFlow variables for the actual parameters of the LSTM.
First, we will define the input gate parameters:
- ix: These are weights connecting the input to the input gate
- im: These are weights connecting the hidden state to the input gate
- ib: This is the bias
Here we will define the parameters:
# Input gate (it) - How much memory to write to cell state # Connects the current input to the input gate ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], stddev=0.02)) # Connects the previous hidden state to the input gate im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], stddev=0.02)) # Bias of the input gate ib = tf.Variable(tf.random_uniform([1, num_nodes],-0.02, 0.02))
Similarly, we will define such weights for the forget gate, candidate value (used for memory cell computations), and output gate.
The forget gate is defined as follows:
# Forget gate (ft) - How much memory to discard from cell state # Connects the current input to the forget gate fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], stddev=0.02)) # Connects the previous hidden state to the forget gate fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], stddev=0.02)) # Bias of the forget gate fb = tf.Variable(tf.random_uniform([1, num_nodes],-0.02, 0.02))
The candidate value (used to compute the cell state) is defined as follows:
# Candidate value (c~t) - Used to compute the current cell state # Connects the current input to the candidate cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], stddev=0.02)) # Connects the previous hidden state to the candidate cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], stddev=0.02)) # Bias of the candidate cb = tf.Variable(tf.random_uniform([1, num_nodes],-0.02,0.02))
The output gate is defined as follows:
# Output gate - How much memory to output from the cell state # Connects the current input to the output gate ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], stddev=0.02)) # Connects the previous hidden state to the output gate om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], stddev=0.02)) # Bias of the output gate ob = tf.Variable(tf.random_uniform([1, num_nodes],-0.02,0.02))
Next, we will define variables for the state and output. These are the TensorFlow variables representing the internal cell state and the external hidden state of the LSTM cell. When defining the LSTM computational operation, we define these to be updated with the latest cell state and hidden state values we compute, using the tf.control_dependencies(…) function.
# Variables saving state across unrollings. # Hidden state saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False, name='train_hidden') # Cell state saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False, name='train_cell') # Same variables for validation phase saved_valid_output = tf.Variable(tf.zeros([1, num_nodes]),trainable=False, name='valid_hidden') saved_valid_state = tf.Variable(tf.zeros([1, num_nodes]),trainable=False, name='valid_cell')
Finally, we will define a softmax layer to get the actual predictions out:
# Softmax Classifier weights and biases. w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], stddev=0.02)) b = tf.Variable(tf.random_uniform([vocabulary_size],-0.02,0.02))
Note: that we’re using the normal distribution with zero mean and a small standard deviation. This is fine as our model is a simple single LSTM cell. However, when the network gets deeper (that is, multiple LSTM cells stacked on top of each other), more careful initialization techniques are required. One such initialization technique is known as Xavier initialization, proposed by Glorot and Bengio in their paper Understanding the difficulty of training deep feedforward neural networks, Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010.
This is available as a variable initializer in TensorFlow, as shown here: https://www.tensorflow.org/api_docs/ python/tf/contrib/layers/xavier_initializer.
Defining an LSTM cell and its operations
With the weights and the bias defined, we can now define the operations within an LSTM cell. These operations include the following:
- Calculating the outputs produced by the input and forget gates
- Calculating the internal cell state
- Calculating the output produced by the output gate
- Calculating the external hidden state
The following is the implementation of our LSTM cell:
def lstm_cell(i, o, state): input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib) forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb) update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb state = forget_gate * state + input_gate * tf.tanh(update) output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob) return output_gate * tf.tanh(state), state
Defining inputs and labels
Now we will define training inputs (unrolled) and labels. The training inputs is a list with the num_unrolling batches of data (sequential), where each batch of data is of the [batch_size, vocabulary_size] size:
train_inputs, train_labels = , for ui in range(num_unrollings): train_inputs.append(tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size], name='train_inputs_%d'%ui)) train_labels.append(tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size], name = 'train_labels_%d'%ui))
We also define placeholders for validation inputs and outputs, which will be used to compute the validation perplexity. Note that we do not use unrolling for validation-related computations.
# Validation data placeholders valid_inputs = tf.placeholder(tf.float32, shape=[1,vocabulary_size], name='valid_inputs') valid_labels = tf.placeholder(tf.float32, shape=[1,vocabulary_size], name = 'valid_labels')
Defining sequential calculations required to process sequential data
Here we will calculate the outputs produced by a single unrolling of the training inputs in a recursive manner. We will also use dropout (refer to Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava, Nitish, and others, Journal of Machine Learning Research 15 (2014): 1929-1958), as this gives a slightly better performance. Finally we compute the logit values for all the hidden output values computed for the training data:
# Keeps the calculated state outputs in all the unrollings # Used to calculate loss outputs = list()
# These two python variables are iteratively updated # at each step of unrolling output = saved_output state = saved_state
# Compute the hidden state (output) and cell state (state) # recursively for all the steps in unrolling for i in train_inputs: output, state = lstm_cell(i, output, state) output = tf.nn.dropout(output,keep_prob=1.0-dropout) # Append each computed output value outputs.append(output) # calculate the score values logits = tf.matmul(tf.concat(axis=0, values=outputs), w) + b
Next, before calculating the loss, we have to make sure that the output and the external hidden state are updated to the most current value we calculated earlier. This is achieved by adding a tf.control_dependencies condition and keeping the logit and loss calculation within the condition:
with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]): # Classifier. loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits_v2( logits=logits, labels=tf.concat(axis=0, values=train_labels)))
We also define the forward propagation logic for validation data. Note that we do not use dropout during validation, but only during training:
# Validation phase related inference logic # Compute the LSTM cell output for validation data valid_output, valid_state = lstm_cell( valid_inputs, saved_valid_output, saved_valid_state) # Compute the logits valid_logits = tf.nn.xw_plus_b(valid_output, w, b)
Defining the optimizer
Here we will define the optimization process. We will use a state-of-the-art optimizer known as Adam, which is one of the best stochastic gradient-based optimizers to date. Here in the code, gstep is a variable that is used to decay the learning rate over time. We will discuss the details in the next section. Furthermore, we will use gradient clipping to avoid the exploding gradient:
# Decays learning rate everytime the gstep increases tf_learning_rate = tf.train.exponential_decay(0.001,gstep, decay_steps=1, decay_rate=0.5) # Adam Optimizer. And gradient clipping. optimizer = tf.train.AdamOptimizer(tf_learning_rate) gradients, v = zip(*optimizer.compute_gradients(loss)) gradients, _ = tf.clip_by_global_norm(gradients, 5.0) optimizer = optimizer.apply_gradients( zip(gradients, v))
Decaying learning rate over time
As mentioned earlier, I use a decaying learning rate instead of a constant learning rate. Decaying the learning rate over time is a common technique used in deep learning for achieving better performance and reducing overfitting. The key idea here is to step-down the learning rate (for example, by a factor of 0.5) if the validation perplexity does not decrease for a predefined number of epochs. Let’s see how exactly this is implemented, in more detail:
First we define gstep and an operation to increment gstep, called inc_gstep as follows:
# learning rate decay gstep = tf.Variable(0,trainable=False,name='global_step') # Running this operation will cause the value of gstep # to increase, while in turn reducing the learning rate inc_gstep = tf.assign(gstep, gstep+1)
With this defined, we can write some simple logic to call the inc_gstep operation whenever validation loss does not decrease, as follows:
# Learning rate decay related # If valid perplexity does not decrease # continuously for this many epochs # decrease the learning rate decay_threshold = 5 # Keep counting perplexity increases decay_count = 0 min_perplexity = 1e10
# Learning rate decay logic def decay_learning_rate(session, v_perplexity): global decay_threshold, decay_count, min_perplexity # Decay learning rate if v_perplexity
decay_count += 1 if decay_count >= decay_threshold: print('\t Reducing learning rate') decay_count = 0 session.run(inc_gstep)
Here we update min_perplexity whenever we experience a new minimum validation perplexity. Also, v_perplexity is the current validation perplexity.
Now we can make predictions, simply by applying a softmax activation to the logits we calculated previously. We also define prediction operation for validation logits as well:
train_prediction = tf.nn.softmax(logits) # Make sure that the state variables are updated # before moving on to the next iteration of generation with tf.control_dependencies([saved_valid_output.assign(valid_output), saved_valid_state.assign(valid_state)]): valid_prediction = tf.nn.softmax(valid_logits)
Calculating perplexity (loss)
Perplexity is a measure of how surprised the LSTM is to see the next n-gram, given the current n-gram. Therefore, a higher perplexity means poor performance, whereas a lower perplexity means a better performance:
train_perplexity_without_exp = tf.reduce_sum( tf.concat(train_labels,0)*-tf.log(tf.concat( train_prediction,0)+1e-10))/(num_unrollings*batch_size) # Compute validation perplexity valid_perplexity_without_exp = tf.reduce_sum(valid_labels*-tf. log(valid_prediction+1e-10))
We employ state resetting, as we are processing multiple documents. So, at the beginning of processing a new document, we reset the hidden state back to zero. However, it is not very clear whether resetting the state helps or not in practice. On one hand, it sounds intuitive to reset the memory of the LSTM cell at the beginning of each document to zero, when starting to read a new story. On the other hand, this creates a bias in state variables toward zero. We encourage you to try running the algorithm both with and without state resetting and see which method performs well.
# Reset train state reset_train_state = tf.group(tf.assign(saved_state, tf.zeros([batch_size, num_nodes])), tf.assign(saved_output, tf.zeros( [batch_size, num_nodes])))
# Reset valid state reset_valid_state = tf.group(tf.assign(saved_valid_state, tf.zeros([1, num_nodes])), tf.assign(saved_valid_output, tf.zeros([1, num_nodes])))
Greedy sampling to break unimodality
This is quite a simple technique where we can stochastically sample the next prediction out of the n best candidates found by the LSTM. Furthermore, we will give the probability of picking one candidate to be proportional to the likelihood of that candidate being the next bigram:
def sample(distribution): best_inds = np.argsort(distribution)[-3:] best_probs = distribution[best_inds]/ np.sum(distribution[best_inds]) best_idx = np.random.choice(best_inds,p=best_probs) return best_idx
Generating new text
Finally, we will define the placeholders, variables, and operations required for generating new text. These are defined similarly to what we did for the training data. First, we will define an input placeholder and variables for state and output. Next, we will define state resetting operations. Finally, we will define the LSTM cell calculations and predictions for the new text to be generated:
# Text generation: batch 1, no unrolling. test_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size], name = 'test_input')
# Same variables for testing phase saved_test_output = tf.Variable(tf.zeros([1, num_nodes]), trainable=False, name='test_hidden') saved_test_state = tf.Variable(tf.zeros([1, num_nodes]), trainable=False, name='test_cell')
# Compute the LSTM cell output for testing data test_output, test_state = lstm_cell( test_input, saved_test_output, saved_test_state)
# Make sure that the state variables are updated # before moving on to the next iteration of generation with tf.control_dependencies([saved_test_output.assign(test_output), saved_test_state.assign(test_state)]): test_prediction = tf.nn.softmax(tf.nn.xw_plus_b(test_output, w, b)) # Reset test state reset_test_state = tf.group( saved_test_output.assign(tf.random_normal([1, num_nodes],stddev=0.05)), saved_test_state.assign(tf.random_normal([1, num_nodes],stddev=0.05)))
Example generated text
Let’s take a look at some of the data generated by the LSTM after 50 steps of learning:
they saw that the birds were at her bread, and threw behind him a comb which made a great ridge with a thousand times thousands of spikes. that was a collier. the nixie was at church, and thousands of spikes, they were flowers, however, and had hewn through the glass, the children had formed a hill of mirrors, and was so slippery that it was impossible for the nixie to cross it. then she thought, i will go home quickly and fetch my axe, and cut the hill of glass in half. long before she returned, however, and had hewn through the glass, the children saw her from afar, and he sat down close to it, and was so slippery that it was impossible for the nixie to cross it.
To summarize, you can see from the output, that we actually formed a story of a water-nixie in our training corpus. However, our LSTM does not merely output the text, but it adds more color to that story by introducing new things, such as talking about a church and flowers, which were not found in the original text.