Jon: Hold on Brandon, when we went over maths 101, didn’t you say numbers become vectors when they have a weight and direction attached to them. But numbers and words are Apples and Oranges! I don’t understand how words could also become vectors. Unless the words are coming from my movie director and he is yelling at me 🙂 … What would the point of words having directions be, anyway?
Dr. Brandon: Excellent question to kick off today’s topic, Jon. On an unrelated note, I am sure your director has his reasons. The following is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava.
Traditional NLP approaches rely on converting individual words–which we created via tokenization–into a format that a computer algorithm can learn (that is, predicting the movie sentiment). Doing this required us to convert a single review of N tokens into a fixed representation by creating a TF-IDF matrix. In doing so, we did two important things behind the scenes:
- Individual words were assigned an integer ID (for example, a hash). For example, the word friend might be assigned to 39,584, while the word bestie might be assigned to 99,928,472. Cognitively, we know that friend is very similar to bestie; however, any notion of similarity is lost by converting these tokens into integer IDs.
- By converting each token into an integer ID, we consequently lose the context with which the token was used. This is important because, in order to understand the cognitive meaning of words, and thereby train a computer to learn that friend and bestie are similar, we need to understand how the two tokens are used (for example, their respective contexts).
Given this limited functionality of traditional NLP techniques with respect to encoding the semantic and syntactic meaning of words, Tomas Mikolov and other researchers explored methods that employ neural networks to better encode the meaning of words as a vector of N numbers (for example, vector bestie = [0.574, 0.821, 0.756, … , 0.156]). When calculated properly, we will discover that the vectors for bestie and friend are close in space, whereby closeness is defined as a cosine similarity. It turns out that these vector representations (often referred to as word embeddings) give us the ability to capture a richer understanding of text.
Interestingly, using word embeddings also gives us the ability to learn the same semantics across multiple languages despite differences in the written form (for example, Japanese and English). For example, the Japanese word for movie is eiga; therefore, it follows that using word vectors, these two words, should be close in the vector space despite their differences in appearance. Thus, the word embeddings allow for applications to be language-agnostic–yet another reason why this technology is hugely popular!
First things first: word2vec does not represent a single algorithm but rather a family of algorithms that attempt to encode the semantic and syntactic meaning of words as a vector of N numbers (hence, word-to-vector = word2vec). We will explore each of these algorithms in depth in this chapter, while also giving you the opportunity to read/research other areas of vectorization of text, which you may find helpful.
What is a word vector?
In its simplest form, a word vector is merely a one-hot-encoding, whereby every element in the vector represents a word in our vocabulary, and the given word is encoded with 1 while all the other words elements are encoded with 0. Suppose our vocabulary only has the following movie terms: Popcorn, Candy, Soda, Tickets, and Blockbuster.
Following the logic we just explained, we could encode the term Tickets as follows:
Using this simplistic form of encoding, which is what we do when we create a bag-of-words matrix, there is no meaningful comparison we can make between words (for example, is Popcorn related to Soda; is Candy similar to Tickets?).
Given these obvious limitations, word2vec attempts to remedy this via distributed representations for words. Suppose that for each word, we have a distributed vector of, say, 300 numbers that represent a single word, whereby each word in our vocabulary is also represented by a distribution of weights across those 300 elements.
Now, our picture would drastically change to look something like this:
Now, given this distributed representation of individual words as 300 numeric values, we can make meaningful comparisons among words using a cosine similarity, for example. That is, using the vectors for Tickets and Soda, we can determine that the two terms are not related, given their vector representations and their cosine similarity to one another. And that’s not all we can do! In their ground-breaking paper, Mikolov et. al also performed mathematical functions of word vectors to make some incredible findings; in particular, the authors give the following math problem to their word2vec dictionary:
V(King) – V(Man) + V(Woman) ~ V(Queen)
It turns out that these distributed vector representations of words are extremely powerful in comparison questions (for example, is A related to B?), which is all the more remarkable when you consider that this semantic and syntactic learned knowledge comes from observing lots of words and their context with no other information necessary. That is, we did not have to tell our machine that Popcorn is a food, noun, singular, and so on.
How is this made possible? Word2vec employs the power of neural networks in a supervised fashion to learn the vector representation of words (which is an unsupervised task).
The above is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. To learn more about the word2vec and doc2vec algorithms such as continuous-bag-of-words (CBOW), skip-gram, cosine similarity, distributed memory among other models and to build applications based on these, check out the book.