Categories: Tutorials

Vector Representation of Words

5 min read

In natural language processing (NLP) tasks, the first step is to represent a document as an element in a vector space. Standard machine learning and data mining algorithms expect a data instance as a vector; in fact, when we say data, we mean a matrix (a row/vector for each data point). There are various ways to express a textual document as a vector, depending on the problem and the assumptions of the model. In traditional Vector Space Models (VSMs), a word is represented as a vector of dimension equal to the size of vocabulary, with each word in the vocabulary corresponding to an entry in the vector. For example, if the text is “Friends work and play together”, then our vocabulary has 5 words. We can represent the words as:

    Friends = [1,0,0,0,0]
    Work = [0,1,0,0,0]
    And = [0,0,1,0,0]
    Play = [0,0,0,1,0]
    Together = [0,0,0,0,1]

Such a representation, called one-hot encoding, is very useful because we can merge these encodings to achieve a vector representation of the entire textual document, which is very central to modern search engines. The above vectors are binary, and there are other encodings possible, such as employing frequency or some other variant. If you are more curious, you can read about TF-IDF. This type of representation has obvious limitations, most importantly, it treats a word as atomic and provides useful information about the relationships that may exist between individual words. We can’t perform any meaningful comparision between words other than equality. Furthermore, such a representation results in word vectors that are extremely sparse.

Distributed Representations

To overcome some of the limitations of the one-hot scheme, a distributed assumption is adapted, which states that words that appear in the same context are semantically closer than the words that do not share the same context. Using this principle, a word can be represented as points in a continous vector space, where semantically similar words correspond to nearby points. This represenation is also called word embeddings, since we are embedding word vectors in the distributed vector space. Essentially, the weight of each word in the vector is distributed across many dimensions. So, instead of a one-to-one mapping between a word and a basic vector (dimension), the word contribution is spread across all of the dimensions of the vector. The dimensions are believed to capture the semantic properties of the words. For example, for our text “Friends work and play together”, each word can be represented as something like:

    Friends = [0.73,0.34,0.52,0.01]
    Work = [0.65,0.79,0.22,0.1]
    And = [0.87,0.94,0.14,0.7]
    Play = [0.73, 0.69, 0.89, 0.4]
    Together = [0.87,0.79,0.22,0.09]

You can see that the words ‘Friends’ and ‘Together’ are closer to each other, and the words ‘Work’ and ‘Play’ have a higher similarity. Note that these vectors are chosen arbitrarily and do not show an actual representation. The sole purpose here is to give an example.

Learning Distributed Representations: Word2Vec

Distributed representations of words can be learned by training a model on a corpus of textual data. Mikolov, et. al. proposed an efficient method to learn these embeddings, making it feasible to learn high-quality word vectors on a huge corpus of data. The model is called word2vec, which uses a neural network to learn representations.

Two architectures for neural network were proposed – Continuous Bag-of-Words (CBOW) and Skip-Gram. CBOW predicts the current word from a window of neighboring words, while skip-gram uses the current word to predict the surrounding words.

Word2Vec models use a sliding window to quantify context. In each sliding window, there is a central word that is under attention, with a few words preceding and following the central word. One of the important parameters of the word2vec models is the length of the sliding window. Consider a textual document:

“The goal of the manifold learning techniques is to ‘learn’ the low dimensional manifold.”

If we consider 4 words preceding and following the central word, the context of ‘manifold’ is: [‘the’, ‘goal’, ‘of’, ‘the’, ‘learning’, ‘techniques’, ‘is’, ‘to’].

These context words form the input layer of the neural network, and each word is represented as a vector using the one-hot schema. There is one hidden layer and one output layer. The output layer is formed by the central words ,that is, each element in the vocabulary. This way, we learn a representation for each word in terms of the context words. The actual ordering of the context words is irrelevant, which is called a bag-of-words assumption.

The skip-gram method is completely opposite of the CBOW method. Here the central word is the input layer, and the context words are now at the output layer. CBOW is faster, but skip-gram does a better job for not-so-frequent words.

The implementation of both the architectures can be found at Google Code. Google has also made public pre-trained word vectors that are trained on about 100 billion words from Google News dataset. Several other data, such as Wikipedia, have been used to compute word vectors.

And modern neural network packages like Tensorflow have word2vec support. Refer to the word2vec tutorial of Tensorflow.

It is important to understand that word2vec is not deep learning; in fact, both the CBOW and skip-gram architectures are shallow neural models with only one hidden layer.

Applications

Distributed representation of words has been successfully applied in many applications. Machine Translation has been shown to achieve much higher accuracy using distributed representations, so you can make the following assertions:

- ```Distance(France, Germany) < Distance(France, Spain)```

- ```Vector('Paris') - Vector('France') + Vector('Italy') ~ Vector(Rome)```

- ```Vector('king') - Vector('man') + Vector('woman') ~ Vector('queen')```

- The odd one in [staple, hammer, saw, drill] is staple.

Item2vec: word2vec for collborative filtering and recommendation system, so you can infer:
```
  Vector(David Guetta) - Vector(Avicii) + Vector(Beyonce) -> Vector(Rihanna)
```
BioVectors: Word vectors for Bioinformations. BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns