How do automation and machine learning support Online Marketing?

#insights

Some time ago, Agnieszka Zawadzka, Head of Research and Analysis in Whites, conducted a webinar on using Machine Learning in online marketing. The subject attracted considerable interest; therefore, we have decided to prepare an article for those who prefer a written version. We invite you to check out the article below.

Machine Learning, deep learning, big data, neural networks, Artificial Intelligence… buzzwords that we hear today everywhere. We know that it’s already everywhere – machine-learning techniques are behind the success of such companies like Netflix, Google, and Spotify. In addition, popular culture, especially literature and science fiction films, has fostered the belief in the almost supernatural powers of “artificial brains” – or neural networks. So, we usually approach Machine Learning with a mixture of dread and delight – we treat it like magic – incomprehensible, hard-to-reach, very costly, and somewhat scary. Yet it is no more – and no less – than mathematics.

Overall, it is little more than the graph of a function – known to us from high-school mathematics. The difference is that this is an investigation of a multitude of functions based on a very large number of coefficients that we are often adjusting. The problem lies not in some alchemical formula or in the particular mathematical advancement of models, but above all, in the data volume needed to find a set of functions that best describe the phenomenon observed. A Machine Learning specialist selects the model and determines which type of function set will be taken into account. This can even be the simplest linear graph of a function! These can also be polynomials or exponential and trigonometric functions – formulas known for centuries. In the case of neuron-based models, there’s no magic either – it’s just a few functions in a specific order.

Indeed, we could calculate this on a piece of paper… Nay! This could be even calculated by a mathematician living a hundred or even two hundred years ago. So, why has ML only been so successful quite recently? Looking for coefficients of a model is actually a set of simple calculations – but you really need to do a lot of them. Until recently, we had insufficient hardware, and more specifically, computers with the right computing power and storage space were simply too expensive to make their business use affordable.

So, how does Machine Learning work?

Machine Learning is searching for regularity in the data collected. The simplest example is supervised learning, which is based on known results. We have a set of characteristics for a depicted process, and we know what the actual result was. Let’s say that we want to recognize an animal species on the basis of various physical data: color, number of limbs, length of body, presence of wings, etc., so we measure and record the values of the individual features of different animals, and next to them, we write down, as a result – the name of the animal (or rather, the number given to it). We take as many as possible such measurements to eliminate errors in pattern detection.

In the simplest mathematical interpretation, let’s look at the linear model: ax + b = y. Due to many parameters, our x will actually be the vector of the features: x = [x1, x2, …, xn] where x1 is the color (expressed numerically), x2 is the length of the body, x3 is the number of legs, etc., while y is the number given to the animal. The equation will then look in a following way: a1 * x1 + a2 * x2 + …. + an * xn = y

By substituting the x values and the corresponding y results, we try to adjust the coefficients to get the y as often as possible. Because we are dealing with supervised learning – that is, learning based on known results – we can see how much our equations differ from the actual classification of the animal. As we try to reduce the discrepancy between the model and reality, we are looking for a minimum error, that is to say, we are performing an activity well known from high school Math classes. The principle is, as we see, simple. However, to achieve a good prediction, as many calculations as possible must be made on a large number of measurements.

Particular individuals of a given species may differ significantly in the value of one of the features. For example, looking at the 7-kg Maine Coon and the 2-kg Chihuahua, we can see that on the basis of weight we are not able to answer the question: what is a dog and what is a cat? However, the whole set of features shows a pattern which the Machine Learning program is increasingly approaching by means of successive substitutions of experimental data.

And it is precisely the existence of such patterns that leads us to another widely used domain of ML: unsupervised learning. Let’s assume that all measurements of the characteristics of the animals are made using automatic sensors, without involving someone who assigns a specimen to a particular species. We know that there were 7 different species in the animal group under examination. We can say that a dog will be “somehow” similar to other dogs, more than horses or ducks. So, without giving specific results for particular individuals, we can divide them into groups according to similarity. However, it is crucial to know how many groups we are dealing with. This is the assumption that must be entered in the model at the outset and if it is wrong, it will lead to an incorrect division. For example, if we introduce too few groups, the model, trying to sort our menagerie, will group reptiles with amphibians or cats to birds.

NLP – how does a computer understand natural language?

Programs can not directly understand words – they work on numerical principles. The interpretation of the language has been being dealt with for several decades by an entire branch of Machine Learning – Natural Language Processing (NLP). At first glance, the issue may seem simple – each word in the dictionary will be given a sequence number and then we can analyze the text. But what about homonyms and polysemes? For example, “coach” meaning a bus or “coach” meaning a sports instructor, or “have”, which may be a verb with multiple meanings (“I have two pencils”, “have another drink”) or a noun (“the haves and the have-nots”). Also, intuition shows that certain words are more or less meaningful in order to understand a text. We also know that changing the word to its synonym does not change the meaning of the sentence – and that is also something that needs to be taught to a model that recognizes natural language. Again, you will need a lot of data to learn which features specified in the model will match.

The first in models language analysis were those running on individual words – starting with the TF/IDF model introduced almost fifty years ago (1972, Karen Spärck Jones). This model followed its first assumptions – i.e., it varied the validity of a word in a given document according to its general popularity (if a word often appears in the entire language corpus, it is less important; it gives less information that is specific to a single text). The TF/IDF weights did not include homonyms, synonyms, and word order. That is, the text “I don’t want to dance. I want to read” was equivalent to the text “I don’t want to read. I want to dance”, and as it is easy to guess, that gives a lot of problems in correctly understanding what this person wants. Worse still, the sentences: “make me a small cup of coffee” and “brew some coffee for me” did not lie close to each other at all, which could lead to a home drama! Language has a sequential structure, and so should the approach to the mathematical models that represent it. This was also the case for the creators of word2vec – a model that assigns numbers to words based on the context in which the word was most frequently present. In a training dataset, where we deal with a huge number of texts (e.g., the National Corpus of Polish, Wikipedia Resources, common crawl), you can check in which company the words are usually present and then give them numerical values so that the dog is close to the dog and far from the tank (unless we are teaching the model on the basis of the popular Polish war film, “Four Tank-men and a Dog”, where the dog “Szarik” was a part of the tank crew). Because each word carries a lot of information, it is not enough to present it with a single number. Therefore, in a computer-based dictionary, which is taught in word2vec, each word is represented by 300 numbers – that is, in mathematical terms, a 300-dimensional vector. In this way, you can express the relationship between words and solve the problem of synonyms.

We look at the likelihood of the word “dog” occurring after the words “small” or “fierce”, and at the same time, the likelihood of the word “dog” occurring before the words “loudly” or “barks”. On the basis of those sentences taken from the corpus, we modify the 300 coordinates that define a given word. Words that have similar meanings will have similar vectors. These models give us a basic understanding of the written word by the algorithm. This allows us to teach the model to recognize, for example, that if you enter the set of words: “smallpox”, “immunity”, and “injection”, the word most closely resembling this set will be the word “vaccination”. Thus, the program can catch the relationship between words.
However, we are still in the area of individual words. Depending on which language corpus we have based our model, the word “kawka” in Polish will be close to a cup of coffee or a bird (“kawka” in Polish means a “western jackdaw” or “small coffee”). No one wants to brew a bird in the coffee machine in the morning. To serve people well, the model should recognize the entire phrase “brew a small coffee”, which will not be achieved by assigning values to individual words rigidly.

BERT

This is a more advanced model that can draw conclusions from entire sentences. It recognizes various contexts and is able to assess whether the word “this” applies to an animal or, for example, a street. It learns what we learn as people intuitively: Assigns a specific weight (attention) to each word depending on their ad hoc context. This was a breakthrough discovery. With this model, we can use Machine Learning in content and marketing activities.

What kind of marketing activities can Machine Learning be used for?

The Internet is the ideal environment for Machine Learning. And in particular, Internet marketing can really take advantage of its capabilities. We have a huge amount of data automatically collected, and most systems offer API, but we can also run tests at a low cost – divide target groups, change the smallest details and check responses – it’s an ideal field for algorithms. In the Internet, algorithms have the perfect place for learning and improving their performance. In advertising and offline sales, gathering such data would be expensive and sometimes downright impossible (e.g., analyzing the behavior of a stationary store’s customers cross-referenced with information regarding who saw the promotional billboard of a particular brand in the last week). However, the collision of our culture’s unrealistic (“magic”) expectations regarding ML’s abilities with the fact that the status quo seems unchanged (“Bah! This is this AI? It can’t even make me coffee!”) often discourages people from taking advantage of what the technology can already deliver. Although ML is not SkyNet, it can still answer many questions that marketers ask every day. We’ll divide these questions into three areas:

SEO
How can we examine traffic and estimate the results?
How can we automate labor-intensive optimization activities?
What makes the website get a high rank in search engines?
Content Marketing
Is the text correct and properly structured? What is its tone?
Is it not a duplicate? (a special challenge on portals with hundreds and thousands of articles)
What should we write about?
Growth Marketing
How can we best create an Advertisement? What words, graphics should be used?
How do I engage users? E.g., what commands should we give to keep them longer?
How can we improve the conversion?
How should we apportion advertising budgets?

We are now acting intuitively on most of these issues – but we need expertise and years of experience. Machine Learning can help us get reliable information that will help us automate these activities. How?

SEO: How can we automate operations and optimize service?

I. Traffic analysis
The SEO specialist often faces the challenge of analyzing an unlikely amount of data. In practice, there is often too much to be able to do this effectively. Without any major difficulty, we can us ML to assess the following:

Has anything changed or are fluctuations normal?
Is the traffic real or generated by bots?
Estimates – what will happen in the near future?
User patterns/areas to improve.

All the actions that rely on copy-paste can and should be automated. How much time does one sometimes spend only on creating reports? We can automate:

Data ingestion by API,
combining data from different sources,
generating automatic reports,
“scrapping” GA,
analysis of keyword cannibalization by GSC,
Optimization/content gap analysis.

These are only examples of actions that can be optimized. It’s worth considering what the machine could do for us in our daily work.

II. “Service cleaning”
Based on what we already know about grouping data (clustering) or assigning it to specific categories (categorization based on known results), we can use numerical representations of words and sentences based on NLP to say which two articles are “close to each other”. Therefore, instead of carefully analyzing hundreds of thousands of articles, we can automatically resolve issues such as:

categorization,
duplicates (and near-duplicates),
internal linking.

III. Guess if you’re on TOP with this text
Based on the data available, we try to find the factors that really determine which websites are ranked in the top positions on Google. To do this, we indicate a group of close phrases, collect the results from the first items in the list, analyze the rankings, and try to predict whether a given service has the chance to get on top. This is difficult to execute fully, but under certain conditions, if phrases are close enough, we can get quite some results. At Whites, we are working on this area and improving the tools to make the best possible assessment of texts.

CONTENT: How can we best improve and evaluate editorial activities?

Machine Learning can be used for the following:

finding content gaps (what to write about),
assessment of the formal correctness of a text,
examination of duplication/close duplication,
verification of topicality,
sentiment analysis,
picture selection.

A more complex issue is editorial revision. Using different models, we can examine it in several levels:

formal construction,
editorial correctness,
topical compatibility (this area requires semantic analyzes – the algorithm must “understand” the text, or rather, position it appropriately in space, or assign to it appropriate coordinates),
literary value (requires behavioral analyses – the most complex issue; in order to do this, the algorithm must be informed about how people rate the text: do they consider it literarily valuable or not? This issue is not mathematically complex or particularly difficult to compute. The problem lies primarily in obtaining a training dataset, i.e., many assessments for many texts).

Growth: How can we optimize conversion?

Machine Learning is useful in the following areas:

Ad optimization,
optimization of settings (API-driven automation),
optimization of the website via conversion,
user analysis, target group clustering,
recommendation systems,
fraud detection,
budget planning.

Many brands are already using many Machine Learning tools. Just look at:

Ad optimization (Google, Facebook),
recommendation systems (different generators),
behavioral analysis systems,
marketing automation.

However, they aren’t making full use of the potential; there are still gaps that we can fill in with our own tools, e.g.:

automatic quality analysis of a campaign manager,
Analysis of Ads/affiliation abuse

A more complex subject – but also extremely prospective – is conversion attribution and the optimal budget distribution.
When we plan a budget, we take into account our own thoughts – we are developing conversion paths; however, we do not have confirmation in data of which areas really make the most profits. What if any channel (SEO, Content Marketing, Social Media) has less impact on the results than we think? Perhaps a slight remodeling, allocating expenses in a different area would bring us better results? Or it may be the other way around, maybe we are overpaying for a channel?

We face a number of challenges in conversion attribution and optimal budget distribution:

Facebook or Google offer closed ecosystems that do not analyze each other,
absence of offline data in analytics,
absence of cost data on non-paid channels in analytics (SEO, social, direct),
cross-sectional data from different industries is needed to complete the funnel.

So how can we approach this topic? Let’s look at spending in terms of the best return over a certain period of time.

With this data-based approach, we will be able to determine how much marketing budget should be allocated to each channel.

Summary

Machine Learning can support marketing professionals in all areas where there are datasets too large to be analyzed effectively by people. It’s also worth thinking about automation where repetitive activities do not require creative thinking. This can enable employees to deal to a greater extent with tasks that require creative, strategic, or deep-analytical thinking.

First of all, let us not ignore the advantages that we can have even today: time savings, less frustration for the people who will not have to deal with tedious and unambitious tasks. Machine Learning models are not perfect – it’s not magic – it’s just a sizable amount of calculation of approximations; it’s looking for patterns that “tolerably match” our complex reality. However, on a daily basis, such a match is sufficient to make a sensible marketing decision that is actually “data-based”.

Previous Service migration – how to do it well and successfully: Medonet case study Next What is the benefit of using Transformer in NLP?