Google’s BERT Update – How is NLP changing SEO in 2020?

#insights

BERT – the 2018 state of the art language model for NLP. It was and still is widely mentioned in SEO because it’s used by search engines to understand the queries users are looking for. With the recent algorithm update, Google has taken one step further towards users, focusing on the quality of content, not quantity.

BERT was not a breakthrough in SEO. The turning-point will be a change in user behavior.

– Agnieszka Zawadzka, Head of Research and Analysis

But what is BERT?

Let’s delve deeper! To really understand BERT you need to understand the concepts that led to its development, otherwise you may as well read the original BERT paper without really understanding what’s going on. What are the prerequisites for understanding BERT? BERT builds on top of a lot of things, among which are Recurrent Neural Networks, LSTMs, Encoder-Decoder architecture, Bi-LSTMs, Attention layers and Transformers. That seems daunting, but is actually totally feasible. Let’s get going!

In BERT training there are 2 phases:

pre-training
fine-tuning

The pre-training and fine-tuning stages for BERT

During the first, the model is trained on unlabeled data with different pre-training tasks.
During fine-tuning, BERT is initialized with the pre-trained parameters and all parameters are fine-tuned using data from downstream labeled tasks. Each downstream task has separate fine-tuned models, but all have the same architecture, which is a distinctive feature of BERT. A downstream task might be any NLP task, for example one of the GLUE evaluation benchmarks.

BERT Example – how BERT and NLP work in practice?

Okay, right now it might not be clear what is meant by a downstream task, Here’s an example. Right when I was typing this sentence I wrote “down”… and Google suggested “downstream task”. This is next word prediction, given the context.

Example:

That means that for further training down the line somebody took some n-gram, masked the last word and trained the model to minimize a loss function for predicting the masked word. That is just an example, and it could be something else, like classifying the class to which text should belong (sports, economy…) or maybe classifying whether a given movie review is positive or negative.

BERT’s architecture is a multi-layered bidirectional transformer, composed of a different number of transformer blocks, hidden layers and multi-head attentions (originally there were 2 versions of BERT: base and large).

The models’ sizes were chosen such that BERT-large matches GPT, but has an advantage over as it has bidirectional self-attention, enabling it to take care of two sides of the context, and GPT can only condition on the left side.

Bert has a fixed vocab, which is impossible to change. So, how does it handle out of vocabulary words? One of the older techniques was to define an “unknown” token for that, but BERT handles it slightly better. Take for example the word “embedding”; it is not in BERT’s vocab, but BERT is able to break down unknown words into subwords, i.e., “embeddings” ==> breaks into “em”, “bed”, “ding”

(very much like fasttext, but in fast text you average all the n-gram embeddings). BERT has a subword for every character, and can even resort to individual letters. The first token of every sequence is always a special classification token denoted [CLS], a separator token is denoted as [SEP].

BERT is a pre-trained model, trained on the following tasks: (originally) – next sentence prediction (NSP) and masked language model (MLM). There have been, however, many studies proposing ways to improve BERT’s performance by extending the tasks:
Some proposed that removing NSP does not impact BERT’s performance and suggested replacing NSP with predicting both the previous and the next sentence. Other approaches suggest Dynamic masking, Beyond-sentence MLM, Permutation language modelling etc.

But let’s focus on the original version.

BERT’s fake tasks:

Masked Language Model – in which some percentage of the input tokens are masked at random and then these tokens are predicted. In the original paper they masked 15% of all tokens. This makes it possible to fuse the left and right contexts.
Next Sentence Prediction – was sentence B found immediately after sentence A? This task is very intuitive so I will skip the description.

The original paper additionally mentions that there are 2 ways of taking advantage of the pre-trained weights. One can add one fully connected layer with a number of neurons equal to the number of classes and use a corresponding activation function (single label or multi label) or one may choose to use a different classifier to classify extracted pretrained embeddings. That is a computationally cheaper approach. But which approach is better? That depends on the dataset. Remember, all of machine learning (at least its classification part) is concerned with finding the classifier that is best suited to the probability distribution that our data is generated from.

How well did all that work? It achieved new state of the art results on 11 NLP tasks.

BERT in the context of SEO – a summary

BERT has not directly caused any inheritance or changes in pages. Its mission is to answer the user’s query in the best way possible, and it does not directly influence a website’s position in SERPs.

BERT for automation and Machine Learning

Using automated solutions, we have a chance to find content gaps, e.g., through advanced analysis of key phrases and keywords from the Search Console API, before BERT and those most recent. Comparing this data and drawing appropriate conclusions may indicate new directions of website development in terms of SEO. Being aware of how Google is developing allows you to better design beneficial actions in the long run.

Previous The attention mechanism and deep learning – a gem among state of the art NLP Next NLP in SEO – a way to understand Google to improve your online visibility