The attention mechanism and deep learning – a gem among state of the art NLP

Ladies and gentlemen, introducing “attention”! This is a crucial mechanism in the deep learning field, causing a revolution in not only natural language processing (NLP) but also in healthcare analytics, and more. Attention – this word is ubiquitous in explanations about current state of the art NLP, and is also often mentioned as one of the building blocks of the novel BERT architecture – but what is it precisely?


Where did attention come from in NLP?

The attention mechanism solves the problem that the up-to-that-moment state of the art, Encoder-Decoder architecture could not solve. Encoder – decoder was published in 2014. The model might have many variations: from a high level, its architecture is divided into one sub-model (be it a sequence of recurrent, convolutional or recursive neural networks) that encodes information from one language (originally devised for machine translation tasks), and another sub-model that “decodes” the same information to some other language.

Why was it needed? Up to that time, deep neural networks couldn’t be used to map sequences to sequences; the only cases where deep neural networks could be used were problems where input and output could be encoded with vectors of fixed dimensionality – which is impossible in the case of sentences, as the same meaning is usually conveyed using different numbers of tokens (words) across different languages.

Although the proposed approach beats the so-far state of the art solution, it is not a perfect solution, especially when the sentences are long and of varying size; then the architecture can “overlook” a relevant word.


The attention mechanism in deep learning – a solution for complex sentences

One solution is the “attention” mechanism, which gathers relevant information from different parts of the sentence and reuses it in the encoded vector so the “decoder” part of the architecture does not lose that information.


Let’s pick an example; it always helps. For example, take the previous sentence “Let’s pick an example, it always helps” – for humans, it is obvious that “it” refers to an “example”, but neural networks need an attention mechanism to understand that.


Types of Attention Mechanisms in Neural Networks

Attention serves the goal of preserving information in long sequences. There are many variants with the 3 most popular being:

  1. dot product – between the encoder and decoder states – gives some understanding of their similarity
  2. additive attention – takes one layer of a neural network to predict the similarities
  3. multiplicative attention – assigns some “weights” to different parts of the sentence that need to be learnt

That is it! There are also variants like the multi-headed attention model, which is an improvement to the single-headed version of attention, which could capture broader context (preserve more information for the sentence).


How knowledge about the attention mechanism can affect your business

Say you’re running a business that is present in social media. With intention classification, it is very quick to check whether customers are saying positive or negative things about you, which in turn might help you decide what to invest in.
Or maybe you are a type of company that helps answer people’s questions, but those questions tend to be very repetitive. Why not implement your own chatbot that will greatly reduce your human labor demand and be available to your customers 24 hours a day?
None of these things can be done without attention.