How machine learning helps to automate text classification?

Author: Mateusz Lewandowski, Data Scientist

The past decade has been an era of data scientists and progress in machine learning tasks. It is hard to identify and enumerate all the applications of those technologies because there are so many. What makes it even more difficult to keep track of is the fast-paced research that is constantly underway. Even though something might not have been supported by a machine learning algorithm very recently, it doesn’t necessarily imply that it is not now. How does machine learning outperform people in text analysis? How much data do we need to create for deep learning models? Let’s explore those areas together!

A while back, humankind learned to store information in a written format. We can confidently say that it was at this point that people began their efforts to classify texts more and more quickly. How come, you might ask? 

Let’s consider one of the most famous texts, the Ten Commandments. According to the Bible, Moses was given these 10 rules for people to follow. 

Anyone who wants to use these instructions to guide their actions might have a difficult task in sight because while the commandments are written in plain language, very few activities are listed by name. How then can you determine if another, unlisted activity is – in fact – forbidden?

Humans know how to do that, and we are pretty good at it. Only very recently, machines have gotten better at these tasks. This is called text classification.

How does machine learning text classification work?

Machines can only deal with numbers, so to process texts, they need to have a way to represent texts in numerical formats. That is where linear algebra comes in handy. Linear algebra is a subfield of math that tells a story about vectors. It turns out that vectors can have an infinite number of dimensions, even though we can’t imagine more than 3, but let’s accept that this is true. 

Vectors might be very similar to each other, they might be opposite, or they might not be related at all. There are a couple of mathematical operations, the result of which allows us to quickly determine the relationship between two vectors. And that is precisely what can be said about words, for they can be alike (synonyms), opposite (antonyms), or not related at all.

Once words are converted to vectors, they can be analysed by machines. The question is how to convert them to vectors. There are a couple of popular ways to do this and a couple of tasks to measure whether the trial was effective. One of the earliest attempts found in the literature is called the Cloze task, popular in language teaching. The task masks some portions of the text and the model (or student) is asked to fill in the gasps. This is a very popular method used in many very successful trials to find vector embedding for words. Vector embedding is nothing more than the vector representation of a given word. Some of the aforementioned ways of converting words to vectors are the following:

  1. word2vec
  2. elmo
  3. flair
  4. glove
  5. bert
  6. many transformer-based models
  7. USE
  8. LASER

So far, we have talked about words – but we promised to consider texts. Obviously, a text is composed of words, we have already stated that we can compare vectors, and going from words to vectors is straightforward. Therefore, it should be easy to compare texts of the same lengths – but surely not all texts are of the same length. How then can we compare different size matrices? To compare and classify texts, they need to be converted into word embeddings first. The second step is applying a model that can differentiate between them.

Source: https://blog.google/products/search/introducing-mum/

What Models are Applied to Classify Text Embeddings?

There are many models and one of the simplest examples would be a function y = ax + b in which we could classify two dimensional vectors as being under or above the line given by that function. For illustration purposes, you can imagine classifying tweets into either positive or neutral. But that tiny example is not going to be of any use in real applications. 

Some of the models, among others, that are popularly applied to classify text embeddings are:

  • Logistic Regression 
  • Naive Bayes
  • Support Vector Machines 
  • Artificial Neural Networks 

The first three are rather considered baseline models, with the simplest baseline (worthy of construction in the first phases of proof of concept development) is term frequency-inverse document frequency (TF-IDF) word embedding and a logistic regression applied on top of it for classification. With that, one should expect that BERT based models are going to blow past that baseline result by a lot. And they should (unless the modeled data is a mess).

Datasets for Deep Learning Models

In this section you will learn:

  • How big a database should be.
  • How to create it.
  • How expensive it is.
  • What conditions have to be met before we can call a set of documents a proper dataset.

There is no working machine learning model without properly collected and prepared data. Unfortunately for the people in the field, a large part of data scientists’ work is based on ensuring that the modelled data is representative of the domain that the “production data” (the thing we want to use our model for) will be coming from. If it is not, the model won’t work.

Deep learning models require a massive amount of data to tune them properly. A unique way to reduce the amount of data needed is transfer learning. Before taking advantage of transfer learning, you’ve got to collect datasets containing tens of thousands or even millions of examples to make sure that the data collected is representative of all possible domains. Fortunately, a few years ago, advances were made – initially in the vision computer application – and it became possible to fine-tune “pre-trained models” for a custom application. 

Let’s consider the significant consequences of that – basically it is only possible to retrain a good quality model in few places in the world – because it requires huge computing power and a lot of talented engineers. To produce systems, few clients would be willing to pay the price and wait such a long time before models start to converge. Luckily transfer learning (adaptation of pretrained data to a similar domain) significantly reduces the computing load necessary and makes it possible to fine-tune huge models with hundreds of millions of parameters on a reasonably good gaming laptop.

As mentioned previously, such a possibility first appeared in computer vision applications and it took about 5 years to come up with models capable of transferring knowledge learned from some corpuses into specific natural language processing applications. With the current state of technology, it is also possible to finetune large BERT models at home. Not only that, but there has been progress in easily available python packages that enable this; transformers, spacy, and the allennlp libraries are perhaps some of the main contributors here. With those tools, fine-tuning the model can take mere hours and already yield reasonably good results.

How Much Data Do We Need to Fine-Tune a Model? How to Create a Dataset?

That depends. I would say that in order to fine-tune BERT – a couple of thousand of examples. But how do you create a dataset that is suitable for downstream applications?

First, the most obvious, and the most expensive solution is to pay human annotators to label parts of texts as instructed. This is sometimes the only possible way, for example, when it was difficult or impossible to create reasonable heuristics to generate synthetic data. But that approach has many possible drawbacks:

  1. It might be difficult to accurately train the annotators to do the task, as data itself might be ambivalent.
  2. The task is usually very laborious, hence the need to limit daily working hours to around 3 in order to avoid mistakes resulting from tiredness. An aspect that has to be monitored after the annotation is cross annotator agreement rate, which should be at the level of around 95%. 

The cross annotator agreement rate is measured when presenting different annotators with texts that had been previously annotated by somebody else, with annotations cleared. Then they are asked to annotate to their best knowledge and all annotations are then compared.

Generation of Training Data in Various Fields – Examples

1. Archeology

Let’s consider a case described in a paper published on ACL Anthology. The authors were trying to extract information from archaeological documents. The approach they chose to follow was fine-tuning a named entity recognition model with custom entities (here archaeologically related entities), which constitutes a rather standard approach in information retrieval tasks. In order to achieve sensible results they paid 5 students of archaeology for 16 hours of work each, and were able to raise their F 0.5 score from 0.5 to around 0.71 with the performance peaking around 7k annotated examples. F – n score is a way of combining precision and recall, saying that one is more or less important than the other (n parameter).

2. Artwork

In another publication authors described a way to automatically generate data for the same task as above – but pertaining to a different domain. The authors point to the scarcity of training data as a main bottleneck for better model performance, as opposed to ever trickier model architectures. They chose to use the Snorkel tool to create high quality annotated data. You will find more details in the paper.

Application of text classification powered by machine learning – Example

Another possible application of text classification tasks is grammarly

It was widely advertised on social media as a platform that helps write essays that inspire by preferring more sophisticated vocabulary over simpler terms. What is very interesting is that authors included cross sentence context in their models

The developers of grammarly have released two papers describing their work, although neither accurately describes how they managed to achieve such excellent results. That is perhaps due to reasonable concerns of losing a competitive advantage over competition.

Yet, they have revealed some aspects of their solution. They have drawn heavily from work on machine translation in which a sequence of tokens in one language is mapped to a sequence of tokens in another language. Instead of translation tasks, grammarly‘s authors adapted their solution to map grammatically incorrect sentences to ones that are grammatically correct by means of a very clever observation that the task can, in fact, be formulated very much the same as in machine translation. They managed to heavily augment their training datasets by scraping data from publicly available websites and introducing controlled spelling or grammar mistakes. What they did not mention, however, was how they managed to incorporate cross sentence context. 

Summing up, their solution works this way: they read sentence by sentence data from a text document, correcting the sentences, but only if their model outputs a very high probability that the correction is correct and then they iterate further, over subsequent sentences. 

Pretty clever, isn’t it?

How do you Choose a Good Model for Downstream Tasks?

Consider also that although the BERT model is lauded everywhere, it might not be the best choice for all NLP applications. BERT was pretrained on two tasks: next sentence prediction and masked language modelling. 

  1. In the first task the model was presented with two sentences, one following the other with a label 0 or 1 to denote if those sentences were in fact consecutive or had just been randomly selected.
  2. The other task, masked language modelling, was treating BERT as an autoencoder – because it was masking part of the words with a predefined token, or substituting them with other words from context. It was, in essence, providing noisy input and was asking the model to denoise the input. 

Some research indicates that for an example of a sequence classification task, better results could be achieved with Universal Sentence Encoders or Lasers. While that might be true in domain agnostic tasks, BERT comes with the advantage that it is available in wide pretrained configurations – it is easily accessible when pretrained on legal, medical etc. corpuses, which is not true for many other models.

BERT is a model that bases its work on Transformer architecture, which proved to be equally efficient as sequence to sequence models, and was faster to train back in 2017. Bert has beaten many NLP benchmarks, but it is effective mainly for short texts only. In fact, it has the inherent limitation of being able to process only up to 510 tokens. That limitation is a consequence of quadratically scaling computational complexity with length. Multiple approaches have been proposed to solve that limitation: Longformers, Reformers, etc, which reduce computational complexity to linear.

Currently the most interesting line of research is being performed in knowledge intensive tasks (RAG models, etc). Knowledge intensive tasks require external knowledge from participants. For example, a lawyer performs a knowledge intensive operation when analysing legal documents, basing his or her judgement on information not directly available in the document. In the same way, sentiment classification could never be considered a knowledge intensive task, as it is trivial to judge whether the sentiment of a tweet is positive or not. Another very similar task is open domain question answering, in which the model is given a set of documents that potentially contain answers to the question that the model is asked, but not necessarily. It turns out that big models can preserve knowledge that was present in a corpus on which they have been pretrained.

Summary 

Progress in machine learning is impressive – more and more tasks are being automated daily, which implies that the most repetitive jobs are not needed anymore. Some may fear that as a result of this, they will lose their jobs and therefore, might try to resist automation. This line of thought seems logical, but is flawed. Poorly paid workers trying to resist change are doomed to failure. It is much better to embrace that change and raise qualifications. The choice is individual. 

Want to ask about machine learning in text classification or how to use machine learning in marketing? Let’s talk.

Learn more about the benefits of machine learning in online marketing:

Source of visual materials:
https://blog.google/technology/ai/lamda/
https://blog.google/products/search/introducing-mum/