Diving into word embeddings: Word2Vec and ELMO algorithms

Up till around 2013, words were treated as atomic units, meaning algorithms would not be able to indicate any similarity between synonymous words, while humans intuitively notice that comparable words somehow “resemble” each other.

The Word2vec paper cites “forceful” and “strong” as an example. The Bag of Words algorithm cannot indicate any similarity between these, although in many cases they can be used as a drop in replacement of each other. Neither can do word representations based on counting word frequency (TF-IDF). So Mikolov et al. came up with the idea described below, and their work yielded surprising results!
Let’s better understand the word embedding models, one by one!


What is word embedding and why is it so important?

Word embeddings stand for a group of language modeling and feature learning techniques in NLP where words are represented by vectors.

ELMo stands for Embeddings from Language Models. Another weird name. Another effort to find even more meaningful word vector embedding. How? This time the authors took a look into recurrent neural networks to use them for embedding and as embedding generators. In the ELMo model, authors propose using multiple RNN layers (bi-directional LSTMs), stacking (concatenate) the hidden representation from each layer next to hidden representations from other layers, and for downstream tasks, using a linear combination of these.


What does the word vector mean?

It is easiest to picture the word vector as an analogy to an arrow in the 2D Cartesian coordinate system:

A vector is characterized by its direction and magnitude. We look for ways to represent words in n-dimensional vector spaces such that these vectors reflect what we humans know about these words. Words are alike? Go ahead, make the vectors point in a similar direction. One word is in the supreme form, while the other is not? Go ahead, change the magnitude of the vectors.


What is the Word2vec algorithm?

One of the first semantically successful (showing similarity between synonyms and not only morphologically similar words) algorithms presenting word representation as a dense, rather than sparse, vector is Word2vec (2013). Depending on the corpus it was trained on, it is able to reflect semantic similarity between words. Thus, it can be also used to automate near-duplicate content detection.


But how precisely does Word2vec work?

Let’s abandon the black box approach and have a careful look under the hood! To start with, the Word2vec algorithm draws heavily from an idea published by Bengio et al. in a 2003 paper titled “A Neural Probabilistic Language Model”. The paper was published before the current AI boom and hence was rather overlooked. It introduces the notion of using a trained hidden layer of a densely connected neural network to find word embeddings.


The structure of Word2vec

Word2vec implements two ideas: a skip-gram model and a Continuous Bag of Words (CBOW). The former is concerned with predicting context words given a target word (i.e., for the word “Word”, what is the probability that the context is “In the beginning was the <Word>”), and the latter is concerned with predicting a target word from bag-of-words context (the other way around from a skip-gram), i.e., given the context “In the beginning was the <…>”, what is the probability that we are looking for “Word”?

Word representations in Word2vec are taken from a simple neural network, which consists of:

  • an input layer,
  • a projection layer,
  • one hidden layer
  • one output layer.

The input layer is the one hot encoded representation of a word (zeros and one non zero element), and the projection layer has dimensionality N x D, while the hidden layer is typically 300 units and learns dense representation. The hidden layer has no activation function. So what is the task that the neural network is learning to perform? Here come both models described earlier! Predicting the-context-given-the-word or the-word-given-the-context are the tasks this neural network is learning to perform.

It is much faster to train the CBOW model than skip-gram, and there are some smart ways of training the model faster: hierarchical softmax and unnormalized negative sampling.


How can you benefit from word embeddings in NLP?

To sum up, word embeddings help you make data-driven decisions. But what does that mean? Let’s break it down. This feature makes it easier to measure your website statistics. For example, you could easily establish the positive to negative comment ratio (sentiment analysis) of the comments on your social media profile or blog, or track trend analysis of expert and market skills. Another advantage is creating an automated executive summary of your content. This way, you can save time and money by reducing your marketing specialist’s work and allocate your budget into more effective marketing activities.