Categories: Tutorials

Vector Representation of Words

5 min read

In natural language processing (NLP) tasks, the first step is to represent a document as an element in a vector space. Standard machine learning and data mining algorithms expect a data instance as a vector; in fact, when we say data, we mean a matrix (a row/vector for each data point). There are various ways to express a textual document as a vector, depending on the problem and the assumptions of the model. In traditional Vector Space Models (VSMs), a word is represented as a vector of dimension equal to the size of vocabulary, with each word in the vocabulary corresponding to an entry in the vector. For example, if the text is “Friends work and play together”, then our vocabulary has 5 words. We can represent the words as:

    Friends = [1,0,0,0,0]
    Work = [0,1,0,0,0]
    And = [0,0,1,0,0]
    Play = [0,0,0,1,0]
    Together = [0,0,0,0,1]

Such a representation, called one-hot encoding, is very useful because we can merge these encodings to achieve a vector representation of the entire textual document, which is very central to modern search engines. The above vectors are binary, and there are other encodings possible, such as employing frequency or some other variant. If you are more curious, you can read about TF-IDF. This type of representation has obvious limitations, most importantly, it treats a word as atomic and provides useful information about the relationships that may exist between individual words. We can’t perform any meaningful comparision between words other than equality. Furthermore, such a representation results in word vectors that are extremely sparse.

Distributed Representations

To overcome some of the limitations of the one-hot scheme, a distributed assumption is adapted, which states that words that appear in the same context are semantically closer than the words that do not share the same context. Using this principle, a word can be represented as points in a continous vector space, where semantically similar words correspond to nearby points. This represenation is also called word embeddings, since we are embedding word vectors in the distributed vector space. Essentially, the weight of each word in the vector is distributed across many dimensions. So, instead of a one-to-one mapping between a word and a basic vector (dimension), the word contribution is spread across all of the dimensions of the vector. The dimensions are believed to capture the semantic properties of the words. For example, for our text “Friends work and play together”, each word can be represented as something like:

    Friends = [0.73,0.34,0.52,0.01]
    Work = [0.65,0.79,0.22,0.1]
    And = [0.87,0.94,0.14,0.7]
    Play = [0.73, 0.69, 0.89, 0.4]
    Together = [0.87,0.79,0.22,0.09]

You can see that the words ‘Friends’ and ‘Together’ are closer to each other, and the words ‘Work’ and ‘Play’ have a higher similarity. Note that these vectors are chosen arbitrarily and do not show an actual representation. The sole purpose here is to give an example.

Learning Distributed Representations: Word2Vec

Distributed representations of words can be learned by training a model on a corpus of textual data. Mikolov, et. al. proposed an efficient method to learn these embeddings, making it feasible to learn high-quality word vectors on a huge corpus of data. The model is called word2vec, which uses a neural network to learn representations.

Two architectures for neural network were proposed – Continuous Bag-of-Words (CBOW) and Skip-Gram. CBOW predicts the current word from a window of neighboring words, while skip-gram uses the current word to predict the surrounding words.

Word2Vec models use a sliding window to quantify context. In each sliding window, there is a central word that is under attention, with a few words preceding and following the central word. One of the important parameters of the word2vec models is the length of the sliding window. Consider a textual document:

“The goal of the manifold learning techniques is to ‘learn’ the low dimensional manifold.”

If we consider 4 words preceding and following the central word, the context of ‘manifold’ is: [‘the’, ‘goal’, ‘of’, ‘the’, ‘learning’, ‘techniques’, ‘is’, ‘to’].

These context words form the input layer of the neural network, and each word is represented as a vector using the one-hot schema. There is one hidden layer and one output layer. The output layer is formed by the central words ,that is, each element in the vocabulary. This way, we learn a representation for each word in terms of the context words. The actual ordering of the context words is irrelevant, which is called a bag-of-words assumption.

The skip-gram method is completely opposite of the CBOW method. Here the central word is the input layer, and the context words are now at the output layer. CBOW is faster, but skip-gram does a better job for not-so-frequent words.

The implementation of both the architectures can be found at Google Code. Google has also made public pre-trained word vectors that are trained on about 100 billion words from Google News dataset. Several other data, such as Wikipedia, have been used to compute word vectors.

And modern neural network packages like Tensorflow have word2vec support. Refer to the word2vec tutorial of Tensorflow.

It is important to understand that word2vec is not deep learning; in fact, both the CBOW and skip-gram architectures are shallow neural models with only one hidden layer.

Applications

Distributed representation of words has been successfully applied in many applications. Machine Translation has been shown to achieve much higher accuracy using distributed representations, so you can make the following assertions:

- ```Distance(France, Germany) < Distance(France, Spain)```

- ```Vector('Paris') - Vector('France') + Vector('Italy') ~ Vector(Rome)```

- ```Vector('king') - Vector('man') + Vector('woman') ~ Vector('queen')```

- The odd one in [staple, hammer, saw, drill] is staple.
  • Item2vec: word2vec for collborative filtering and recommendation system, so you can infer:
      Vector(David Guetta) - Vector(Avicii) + Vector(Beyonce) -> Vector(Rihanna)
  • BioVectors: Word vectors for Bioinformations. BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns

References

Janu Verma

Share
Published by
Janu Verma

Recent Posts

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago