NLP in SEO – a way to understand Google to improve your online visibility

Google’s tale of a word being a noun and a verb: that is, how can I improve my company’s visibility by understanding how Google works, and what is NLP (Natural Language Processing) in this process?

 

How can I benefit from applying NLP to SEO?

Let’s think about some common cases where this is useful for something more than understanding the history of Natural Language Processing. Here are some ideas:

  1. automatic detection of a document’s topic
  2. sentiment classification
  3. classification between entities
  4. paraphrase identification
  5. semantic role labeling
  6. question answering
  7. predicting box office revenues of movies based on critic reviews
  8. modelling interest in a text
  9. relation between character-sequences and part of speech tags

 

Let’s step into the world of NLP

Mankind has been massively analyzing text since print became wide-spread.
Machines started analyzing text some time later, in the 20th century, and now we are experiencing a boom in text analysis methods, or so-called Natural Language Processing.

What is NLP in SEO?

We all know Google. You may have first become familiar with it as a noun, a company name. That was a while ago, and seamlessly, from the beginning of its life as a noun, the word became a verb, synonymous with search. That was also some time ago, for the way we construct queries for Google, or rather Googling has also changed dramatically. A couple of years back, we would have employed a dictionary-like style of searching for information. But now we can type in a natural question, like we would ask another human being, and get a specific answer to that question. How did that come about? Let’s see!

 

Sentiment analysis and a magic Bag-of-Words algorithm

Machines prefer to deal with digits; hence, we need to assign numerical representation to a word. Or a letter. Or a document.

The idea of assigning a unique digit to each unique letter is really tempting. There are 26 letters in the Latin alphabet, so why not assign A – 1, B – 2, C – 3, etc. and rejoice over a problem solved? Let’s see an example. Let’s take the word, “word”: W – 23, O – 15, R – 18, D – 4. 2315184 = word. But how do we know that it is not B C O R D (2,3,15,18,4)? We don’t.

There are many words in a language. In 2010, Harvard estimated around 1 million words in English, and the second edition of the Oxford English dictionary lists about 600,000 words, of which 171,476 are in current use. An average college-educated English native speaker knows between 20k and 30k words. That is quite a few, but this is doable. The NLP approach to this would be the so-called Bag of Words, where we assign a unique index to each word in the corpus (the set of all words being analyzed). For example, when given a text to classify the sentiment of a comment (positive vs. negative) and two comments:

So far, so good, but we have lost the order of words. Is “good movie, no regrets!” the same as “not a good movie, I regret it”? Totally not. Besides, now it is clearly visible that there are some words that carry no meaning and can be ignored—they are called stop-words. In this case they are [this, is, I, on, it, and]. We have also arrived at the concept of a token – a token is a useful semantic unit for processing, usually a word. It is easy in English (not so much in Chinese or Thai) to tokenize a document – split words on whitespaces (and remove punctuation). The bigger corpus we have, the larger the Bag of Words vector becomes, and handling sparse vectors (many zeros, for example a short sentence and a large corpus) does require a lot of memory. One technique to deal with this issue is Lemmatization – taking the word to its root (playing – play). That way we neglect all the words’ variations arising from the context and words’ forms, preserving the vast majority (if not all) of the information contained in the sentence. What might look childish for humans, does the job for the machine.

Bag of words was used for spam (delivered to email) detection for a while, but has the aforementioned problems, which can be solved using the following methods.

 

Sentiment analysis – lights and shadows of N-grams and TF-IDF

One solution is using n-grams. N grams are tokens consisting of n words. In this way, considering our corpus, we would have the following 2-grams:
[this movie, movie is, is brilliant] for the first line and a similar set for the second.
Some n-grams occur often, for example “Hong Kong” or “San Francisco”, while some do not, for example “is brilliant”. It is problematic to find the right n. Also we are only interested in medium frequency n-grams (high frequency – stop words, low frequency – typos). But there are many medium frequency n-grams, hence the idea of ranking them; i.e., an n-gram with a smaller frequency can be more discriminating because it can capture a specific issue in the review. The algorithm that considers this is called TF-IDF (term frequency – inverse document frequency). When a text is preprocessed in this way it is easy to use some classifier for further tasks. The aforementioned methods, however, do not take into account any similarities between the meanings of words, which could cause your content to appear, based on the similarity of your keywords to the user’s search terms. For example “great” and “wonderful” are more similar to each other than “great” and “yacht”.

 

Sentiment analysis algorithm accuracy and process

Let’s now take a closer look into some more recent algorithms:

One of the most noticeable differences is representing each feature (word) as a dense vector (with no zeros) rather than one-hot representation (each word having a unique number assigned to it), which is a solution to the mentioned problem. With dense representations, words with similar meanings are “closer” to each other, while in one hot encoding there is no noticeable similarity between words at all.

One way to think of similarities is with a 2D Cartesian coordinate system; words with similar meanings should have vectors with similar angles and lengths, which holds true for more dimensions (just more axes). For illustration, here’s a 2D picture of a vector.

 

Illustration 1: Vector in 2D

 

Algorithms that propose encoding a semantic meaning into vector representations of words are Word2vec (2013), Doc2Vec (2014), GloVe (2015), and FastText (2016).

 

Google’s Word2Vec model in Google language processing

The Word2vec algorithm finds a fixed-sized vector for each word, which is problematic for comparing sentences with varying number of words – as is usually the case, and they need to have exactly the same size of feature matrix for the math to work. There are two main approaches addressing this problem: one way is to “pad” all feature matrices (representing each sentence) to the length of the longest one, but in this case it is usually true that one ends up with many zeros, adding absolutely no value for the math. Another way of dealing with different length sentences is to average them all (resulting in a centroid in a vector space which is average of all words), as illustrated below:

Illustration 2: Centroid

 

For long sentences, this results in a lot of lost meaning. Word2vec (Google) has another problem – it needs to be “trained” on some vocabulary, and it only works for words that were present during “training”, so if in a comment someone makes a typo – word2vec doesn’t recognize the word and it outputs a so-called “out-of-vocabulary” OOV error.

 

How to handle typos occurring during NLP training? Doc2Vec and other algorithms

One of the approaches to solving the varying dimensionality problem is using the Doc2Vec (Google) algorithm, which looks at an entire sentence first and, regardless of the number of words, then gives a constant size feature matrix for each sentence – which, according to the publication that introduced Doc2Vec, yields better results than averaging word2vec. 

Another algorithm that handles OOV errors is FastText (Facebook). FastText does not treat each word as an entity (in contrast to word2vec), but it rather treats each character in a word as an entity and hence, can handle typos (for example, resulting in a very similar vector for the words “words” and “wordf”). Another algorithm for word embedding is GloVe (Global Vectors for Word Representation) by Stanford.

Currently, BERT (Google) and XLNET (Google) are considered SOTA (state of the art). What do these algorithms do?

 

BERT – the hero of the Google NLP story?

Most of us have come across this mysterious four letter word, which may be a holy grail to understanding Google and increasing revenue. But do you know why that is and what’s under the hood? Let’s explore this mysterious model together.
BERT beats the previous state of the art results – GLUE, MultiNLI, SquAD v1.1, and SquAD v2.0. What do these benchmarks measure?
GLUE (designed by NYU) – General Language Understanding Evaluation asks:

  1. is the sentence grammatical?
  2. is the movie review positive/neutral/negative?
  3. is sentence A a paraphrase of sentence B?
  4. how similar are the two sentences?
  5. does sentence A entail or contradict B?
  6. does B contain the answer to the question in A?
  7. does A entail B?
  8. are the two questions similar?
  9. when sentence B replaces sentence A’s ambiguous pronoun – is this the correct noun?

MultiNLI – a big corpus for sentence understanding through inference

SquAD – Stanford question answering dataset

BERT originates from Transformers, introduced by Google in 2017 and taking advantage of the “attention” mechanism, which beat the then state of the art Recurrent Neural Networks in far less training time (and their combinations with convolutional neural networks). Also, it was significantly less compute-intensive (far fewer days of GPU power required to “train” the model), but that is a long story and we’ve described it more fully in another article about Google’s BERT update. BERT comes with “pre-trained” weights – that means that somebody has taken a very large corpus and found the proper parameters for all the “layers” within the BERT model.

 

Google Natural Language Processing results – what is important for your business?

Abracadabra!

Natural Language Processing has evolved in recent years. We’ve described how SEO activities might benefit from Natural Language Processing focused mainly on user intent in search. Applying NLP methods to your SEO strategy can support data-driven content creation and data-driven decision making, and from there, it is close to the hearts, minds, and wallets of users.
If you feel like we could help you grow your business, do not hesitate to get in touch with us!