Machine Learning in SEO – the key to great website content in 2020Author:
The three types of SEO
Search Engine Optimization (SEO) can be divided into 3 broad categories: technical, content, and UX:
- Technical SEO is a process that focuses on crawling and indexing optimization. It takes care of such aspects as website speed performance, dead links identification, redirect loops, content duplication, security or structural data.
- UX-based SEO builds on the concept that readability (font size, color scheme, ARI index, etc.), text density, internal linking, topicality or number of images influence search rankings.
- Overall, any content created with the goal of attracting search engine traffic falls into the SEO content category. From the machine learning point of view, the SEO content is the most interesting in the era of accelerated advances in Natural Language Processing (NLP). It can assist in content creation, content organization, or content discovery. Additionally, from the global perspective, machine learning can be utilized in reverse engineering studies that can help prioritize website optimization strategy and point towards ranking factors that really matter for a specific category.
The role of Machine Learning
Machine Learning is an important area of AI. The goal of Machine Learning is to learn from data without having to explicitly program the algorithm and to make predictions based on the data received. Machine learning is not only a domain of optimization, where complex patterns present in the data are exploited to improve existing processes (e.g. credit scoring, predictive maintenance, logistics). It can also be used to automate human tasks and centralize the decision-making process. In this context, Machine Learning provides unprecedented scalability and control (it is easier to de-bias a model than an army of people). The value it provides, however, results from data and a clear definition of the business problem that needs to be resolved. We compiled a significant number of examples, focusing strongly on textual content, in which Natural Language Processing techniques may serve as both optimization and automatization strategies.
Content creation and discovery
Google says that the key to success is getting your content in front of the right people. It means that understanding what they want and how they are searching for it is a crucial step – and it can be automated with Machine Learning. The most relevant search results are the ones that match the user’s search intent. When you feel that your content is written well, but does not rank high then most likely the problem lies in the intent mismatch between what users want and what your content provides. In general, users may look for information, perform an action or just want to make a transaction. The Intent of a particular query can be discovered by analyzing search results.
Although this task can be performed by a human, in that form it is not scalable. Machine Learning can be utilized to build a model that is capable of determining the intent based on a query. Strategically, the intent classification system can be used as a diagnostic tool, which analyzes phrases from Google Search Console (GSC) and identifies which websites need a change due to intent mismatch. In general, it can also serve as a ranking signal in the reverse engineering approach.
How to classify query intent with the BERT model?
Fig. 1 shows how deep neural networks classify query intent with confidence probabilities (own studies). We finetuned the BERT  model on the 6-class classification task and analyzed its performance in terms of F1 score. Three different scenarios are presented in the figure.
Fig.1: Intent classification probabilities using the BERT model. Orange bars mark the correct class. We used BERT with 110 million parameters  and finetuned it on 6 class classification task using ~5000 labeled examples. It was trained for ~2000 iterations on NVIDIA Tesla P100. In the top figure, BERT correctly classifies with very high confidence the query about HUMAN class. In the middle figure, the model is less certain about the correct class (0.57). In the bottom figure, it makes a wrong prediction about the intent, but its second guess would still be correct. Overall, we got F1 = 0.9 on test set.
Valuable content is planned in advance. The skeleton of concepts and relations between them is then laid out in the form of written words. Building a complete informative service with thousands of articles is a difficult and time-consuming process that may take months. Machine Learning can help here in a scalable way. Well written content is complete and typically comes in the form of a graph of entities very similar to knowledge graphs used by Google or other search engines. For example, if you write about a particular disease, you should also include related information such as treatment, symptoms, and cause. These kinds of triples that come in the form of subject-predicate-object can be mined directly from online texts [1,2], stored in a graph database and utilized in content planning tasks. At a less advanced level, automatic keyword extraction techniques and word embeddings, such as Word2Vec, may be used to generate ideas about the content. From a business perspective, complete and coherent content increases user’s trust which in reverse has a positive effect on CTR, behavioral factors, brand awareness or return rate.
Concept graph extracted from Word2Vec
In Fig. 2 we show a sample concept graph associated with the phrase “natural language” from our model. When it comes to keyword extraction, simple unsupervised algorithms such as TextRank  can do the job. It can be augmented with custom preprocessing rules like collocation learning, coreference resolution or named entity recognition to better recognize key phrases.
Fig. 2: Concept graph produced using word vector similarity extracted from Word2Vec [10,11] with a skip-gram model. In Word2Vec every word has two different embeddings associated with it. Blue ones are semantically related to “natural language” and the orange ones are the highest scored context words.
Articles that have already been written may be boosted as well. Automatic content enrichment may come in a form of title, description, lead, summary or headline generation. These elements do not require generation from scratch but are conditioned on an already existing text. Top ranking websites may be used as a training set in order to direct Machine Learning models towards a generation that will get more traffic to the target website. A much more advanced approach, in a form of neural text generation, may help you automate repetitive tasks or diversity in SERPs and increase ranking position. Short and structured pieces of content, such as match summaries, weather reports or product description can be generated automatically. The last one can also benefit from user reviews/comments mining which provides information about parts that are important for buyers. Moreover, image captions or headlines can be generated based on the same idea .
Text generated by a deep learning language model
In Fig. 3 we show how deep learning language models can be fine-tuned for text generation task.
Fig. 3: Samples of text generated (coupled with human judgment) with OpenAI’s GPT-2 model . We used a medium-sized GPT-2 model with 345 million parameters and finetuned it on ~160 MB of online articles related to machine learning and SEO. It was trained for 4000 iterations with learning rate annealed linearly to 0.
Domains always benefit from a better content organization as observed from our cases. The organization may come in various forms such as document tagging, categorization or linking. All of them enhance navigation and exploration of the service and thus improve user experience and indirectly influence the ranking position. Moreover, properly organized service increases conversion rates (this is why recommendation systems are so popular) or generates income from page views (ads). Although the techniques are similar in design, they differ in a subtle way. Categories are usually hierarchical and segment the space of documents.
Tagging, on the other hand, is ambiguous since a document can have many different tags that summarize different aspects of the content. Thus, tagging gives a more flexible way of organizing the content, and it proves to be useful in News or Technology.
Linking is a form of referral that expresses relatedness between Document A and Document B. It may be due to text similarity, the existence of named entities or other factors. From the modeling point of view, what distinguishes linking from other similarity-based algorithms is its directed nature and lack of symmetry. As mentioned before, when Document A links to Document B, the reverse situation is not necessarily true. This effectively makes it a sequence problem or binary classification problem with position encoding.
Fig. 4: Concept of hierarchical clustering applied to document segmentation and organization. The decision about the number of clusters can be made by a specialist afterward (3 in the figure).
Identifying similar documents and organizing them may be tackled with unsupervised methods such as clustering. From our experience, hierarchical clustering is characterized by better adoption since information about the number of clusters does not need to be provided upfront, like in many popular clustering algorithms. Instead, it can be left for a specialist to decide on. Clustering quality depends crucially on the document representation and the pairwise distance definition. In most cases, distance is defined with a cosine angle between word frequency vectors or TF-IDF. There are two limitations rooted in this approach. Firstly, counting statistics treats every word in the same way, but in practice some words or phrases may be more important than the other. This may happen in News, Technology or Health, where named entities (such as people, organizations or disease name) play a key role in the content even though they might appear just once in a piece of content. Secondly, standard cosine distance completely ignores both the position of words in the text and their first mention  in particular. Proper similarity measure may also emerge as an indirect product of self-supervised pretraining augmented with supervised learning. A notable example of this approach is Google’s Universal Sentence Encoder .
Briefly on reverse engineering
Understanding how search engines rank pages based on the query is a hard and complex problem. Basically, a collection of ranking signals is blended together and a decision about the search result order is made by an algorithm – with the aim of meeting the information need expressed by a user. Knowing which ranking signals are at stake is one issue, and discovering how these factors influence each other and which ones are the most important is another one. This is exactly where Machine Learning can help. A model that was trained on existing SERPs can be initially used to extract information about the feature importance. In the second stage, it could be directly exploited to recommend changes that maximize the ranking position. This stage itself creates another challenge since most popular models are discriminative and changes suggested by them might not be physically possible. More on that topic in the next article.
Search engines are continuously evolving and so does the field of Machine Learning. In particular, the research on Natural Language Processing accelerated in recent years. We described how SEO can benefit from Machine Learning concentrating heavily on content. Some clients want the whole service to be built from the ground up and some need a change on the existing platform. In both cases, Machine Learning can help in content discovery, planning or organization. It can also serve as a diagnostic tool.
For more on this topic we recommend that you read:
 Evgeniy Gabrilovich & Nicolas Usunier, Constructing and Mining
Web-scale Knowledge Graphs, 2016, http://www.cs.technion.ac.il/~gabr/publications/papers/SIGIR-2016-KG-tutorial.pdf.
 Xiang Ren, et. al., Building Structured Databases of Factual
Knowledge from Massive Text Corpora, 2017, http://xren7.web.engr.illinois.edu/sigmod17-StructNet.pdf.
 Slava Novgorodov, et.al., Generating Product Descriptions from User Reviews, 2019.
 Llorenc Escoter, et.al., Grouping business news stories based on the salience of named entities, 2017.
 Daniel Cer, et. al., Universal Sentence Encoder, 2018.
 Alec Radford, et.al., Language Models are unsupervised multitask learners, 2019.
 Jacob Devlin, et.al., BERT: Pre-training of deep bidirectional transformers for language understanding, 2018.
 Rada Mihalcea and Paul Tarau, TextRank: Bringing order into texts, 2004.
 Thomas Mikolov, et.al., Distributed representations of words and phrases and their compositionality, 2013.
 Thomas Mikolov, et.al., Efficient estimation of word representations in vector space, 2013.