Categories: Tutorials

Modern Natural Language Processing – Part 2

9 min read

In this series I am going to first introduce the basics of data munging—converting from raw data into a processed form amenable for machine learning tasks. Then, I will cover the basics of prepping the data for a learning algorithm, including constructing a customized embedding matrix from the current state of the art embeddings (and if you don’t know what embeddings are, I will cover that too). I will be going over a useful way of structuring the various components–data manager, training model, driver, and utilities—that simultaneously allows for fast implementation and flexibility for future modifications to the experiment. And finally, I will cover an instance of a training model, showing how it connects up to the infrastructure outlined here, then consequently trained on the data, evaluated for performance, and used for tasks like sampling sentences.

Here in Part 2, we cover Igor, embeddings, serving data, and different sized sentences and masking.

Prep and Data Servers

Given the earlier implementations (see Part 1), the data is a much more amenable format. However, now it needs to be loaded, prepped, and poised for use.

Igor

The manager for our data and parameters is nicknamed Igor for the assistant to Frankenstein. I will get into many of these functions in the next blog post. For now, it is vital to know that Igor can store the parameters in its __dict__ which allows for referencing using dot notation.

## igor.py
import yaml

class Igor(object):
    def__init__(self, config):
        self.__dict__.update(config)
   
    @classmethod
    def from_file(cls, yaml_file):
        withopen(yaml_file) as fp:
            return cls(yaml.load(fp))

Embeddings

Now that we have our data in integer format, let’s prep the rest of the experiment.

A vital component in many modern day NLP systems are what have been called the ‘sriracha’ of NLP: word embeddings. What exactly are they though? They are individual vectors mapped to tokens (like our integers) that were trained to optimize learning objectives that encouraged similar words to have similar vectors. The reason they are so useful is that it gives the model a head start—it can immediately start associating overlapping signals from similar words in different sentences.

We’re going to work with GloVe embeddings. You can obtain all of them for free from the Stanford website. The following code assumes an igor that has various vital parameters:

#### embedding.conf
embedding_size: 300
target_glove: /path/to/glove/glove.840B.300d.txt
vocab_file: /path/to/vocab/words.vocab
save_dir: /path/to/savedir/
data_file: data.pkl

It also assumes that the 300-dimensional, 840 billion common crawl vectors are used.

There are smaller ones if they are more appropriate to your task. We will be only using a subset of the vectors.

And then you can use a function like the following to compute an embedding matrix. In the next blog post I will cover how to use it. Note that tqdm is used here, but it doesn’t have to be. It’s a very handy progress bar. Also note this: I use the keras implementation of the Glorot Uniform initializer for words that aren’t in the embedding data.

### utils.py
from os import makedirs, path
import tqdm
from keras.initializations import glorot_uniform

def embeddings_from_vocab(igor, vocab):   
    print("using vocab and glove file to generate embedding matrix")
    remaining_vocab = set(vocab.keys())
    embeddings = np.zeros((len(vocab), igor.embedding_size))
    print("{} words to convert".format(len(remaining_vocab)))

    if igor.save_dir[-1] != "/":
        igor.save_dir += "/"
    if not path.exists(igor.save_dir):
        makedirs(igor.save_dir)

    fileiter = open(igor.target_glove).readlines()

    for line in tqdm(fileiter):
        line = line.replace("n","").split(" ")
        try:
            word, nums = line[0], [float(x.strip()) for x in line[1:]]
            if word in remaining_vocab:
                embeddings[vocab[word]]  = np.array(nums)
                remaining_vocab.remove(word)
        exceptExceptionas e:
            print("{} broke. exception: {}. line: {}.".format(word, e, x))
   

    print("{} words were not in glove; saving to oov.txt".format(len(remaining_vocab)))
    withopen(path.join(igor.save_dir, "oov.txt"), "w") as fp:
        fp.write("n".join(remaining_vocab))

    for word in tqdm(remaining_vocab):
        embeddings[vocab[word]] = np.asarray(glorot_uniform((igor.embedding_size,)).eval())
   
    withopen(path.join(igor.save_dir, "embedding.npy"), "wb") as fp:
        np.save(fp, embeddings)

Serving Data

Igor’s main task of serving data is broken down into two key functions: serve_single and serve_batch. The class below is more fleshed out than the last Igor class. This time, it includes these two functions as well as some others.

There are several main things to notice in the implementation below:

1. Each sentence is placed into a zero-matrix that is potentially larger than it. This is essential to what is known as masking (more on this below).

2. The sentences are offset by one and the target data is the next word.

3. The data is being served in batches. This is to maximize the efficiency of GPU capabilities.

4. The target variable, out_Y is being formatted with a to_categorical function. This encodes an integer as a one-hot vector. A one-hot vector is a vector with zeros at every spot except for one position – It is going to be used here with a cross entropy loss – which basically just means it will compute the dot product between the probability of every output (which is the same size as out_Y) and out_Y. In effect, this is the same thing as selecting a single element from the output probability vector.

### igor.py
from keras.utils.data_utils import get_file
from keras.utils.np_utils import to_categorical
import yaml
import itertools
import numpy as np
try:
    import cPickle as pickle
except:
    import pickle
from utils import Vocabulary



class Igor(object):
    def__init__(self, config):
        self.__dict__.update(config)
   
    @classmethod
    def from_file(cls, yaml_file):
        withopen(yaml_file) as fp:
            return cls(yaml.load(fp))

    @property
    def num_train_batches(self):
        returnlen(self.train_data)//self.batch_size

    @property
    def num_dev_batches(self):
        returnlen(self.dev_data)//self.batch_size
       
    @property
    def num_test_batches(self):
        returnlen(self.test_data)//self.batch_size
       
    @property
    def num_train_samples(self):
        returnself.num_train_batches * self.batch_size
       
    @property
    def num_dev_samples(self):
        returnself.num_dev_batches * self.batch_size
       
    @property
    def num_test_samples(self):
        returnself.num_test_batches * self.batch_size

    def _serve_single(self, data): 
        for data_i in np.random.choice(len(data), len(data), replace=False):
            in_X = np.zeros(self.sequence_length)
            out_Y = np.zeros(self.sequence_length, dtype=np.int32)
            bigram_data = zip(data[data_i][0:-1], data[data_i][1:])
            for datum_j,(datum_in, datum_out) in enumerate(bigram_data):
                in_X[datum_j] = datum_in
                out_Y[datum_j] = datum_out
            yield in_X, out_Y

    def _serve_batch(self, data):
        dataiter = self._serve_single(data)
        V = self.vocab_size
        S = self.sequence_length
        B = self.batch_size

        while dataiter:
            in_X = np.zeros((B, S), dtype=np.int32)
            out_Y = np.zeros((B, S, V), dtype=np.int32)
            next_batch = list(itertools.islice(dataiter, 0, self.batch_size))
            iflen(next_batch) < self.batch_size:
                raiseStopIteration
            for d_i, (d_X, d_Y) in enumerate(next_batch):
                in_X[d_i] = d_X
                out_Y[d_i] = to_categorical(d_Y, V)
               
            yield in_X, out_Y

    def _data_gen(self, data, forever=True):
        ### extra boolean here so that it can go once through while loop
        working = True
        while working:
            for batch in self._serve_batch(data):
                yield batch
            working = working and forever
       
    def dev_gen(self, forever=True):
        returnself._data_gen(self.dev_data, forever)

    def train_gen(self, forever=True):
        returnself._data_gen(self.train_data, forever)           
       
    def test_gen(self):
        returnself._data_gen(self.test_data, False)

    def prep(self):
        ## this assumes converted integer data has been placed into a pickle
        withopen(self.data_file) as fp:
            self.train_data, self.dev_data, self.test_data = pickle.load(fp)

        ifself.embeddings_file:
            self.saved_embeddings = np.load(self.embeddings_file)
        else:
            self.saved_embeddings = None

        self.vocab = Vocabulary.load(self.vocab_file)
        self.vocab_size = len(self.vocab)

        self.sequence_length = max(map(len, self.train_data+self.dev_data+self.train_data))

Different Sized Sentences and Masking

There is one last piece of set-up information. In order to handle different sized sentences, you need to use a mask. What exactly is a mask, though? Well, since we are loading our data into a matrix that has the same size on each dimension, we have to adjust the values for the sentences of different length.

For this task, we will use a specific numeric value set at the positions where there is no data. This is recognized by Keras internally as corresponding to positions where it should mask. More specifically, since we are using the Embedding layer, it will check where the input data equals this masked value. It will then push a binary matrix forward through your construct model so that it gets used in the correct spots. Note: There are a few types of layers which Keras can’t push the mask through (without some clever finagling), but for this model it will. I will discuss how the mask gets used in the next post, but just know that the zero indexed token in our Vocabulary and the zeros in the data matrix correspond to masked positions.

Conclusion

And that’s it! The data is now ready to be loaded up and served.

An end-of-post note:

Most of the prep code should be placed into a single, preprocessing script. It’s sometimes easy just to add it to the bottom of the utils file.

#### at the bottom of utils.py
if__name__ == "__main__":
    print("getting data")
    raw_data = get_data()
    print("processing data")
    data, indices = process_raw_data(raw_data)
    print("formatting data")
    data, vocab = format_data(*data)
    print("making embeddings")
    from igor import Igor
    igor = Igor.from_file('embedding.conf')
    withopen(igor.data_file, 'w') as fp:
        pickle.dump(data, fp)
    vocab.save(path.join(igor.save_dir, igor.vocab_file))
    embeddings_from_vocab(igor, vocab)

and some of the important igor parameters so far:

batch_size: 64
embedding_size: 300
rnn_size: 32
learning_rate: 0.0001
num_epochs: 100
### set during computation
vocab_size: 0
sequence_length: 0
### file stuff
data_file: data.pkl
vocab_file: words.vocab
embeddings_file: embedding.npy #~ # /path/to/embedding.npy ## or, if none, then ~
checkpoint_filepath: cp_weights.h5

Be sure to read Part 3 where I outline a language model and discuss the modeling choices. I will outline the algorithms needed to both decode from the language model and to sample from it.

About the author

Brian McMahan is in his final year of graduate school at Rutgers University, completing a PhD in computer science and an MS in cognitive psychology. He holds a BS in cognitive science from Minnesota State University, Mankato. At Rutgers, Brian investigates how natural language and computer vision can be brought closer together with the aim of developing interactive machines that can coordinate in the real world. His research uses machine learning models to derive flexible semantic representations of open-ended perceptual language.

Brian McMahan