What is the benefit of using Transformer in NLP?

Transformer is a deep learning model used in NLP. Transformer made it possible to facilitate greater parallelization during training and thus enabled training for larger data sets. How does Transformer work? What problems in encoder-decoder architecture does it solve?

After the attention mechanism was added to encoder – decoder architecture, some problems persisted. The aforementioned encoder-decoder architecture was extremely time-consuming to train, it took 10 days to train it on 8 GPU clusters, which is not doable if one does not have a very big budget for training such architecture. Additionally the encoder-decoder architecture was difficult to “train”, because it exhibits the so-called “vanishing / exploding gradient problem” and is difficult to parallelize, even when one has computational resources (which is one reason why it is time consuming to train; the other is that such networks – LSTMs – have an enormous amount of “hidden” units to train). Why is Transformer a better solution? Transformer uses exclusively attention building blocks and gives very good results with far less engineering time required to tune it. Transformer suffers from neither of the problems that encoder-decoder architecture encounters (vanishing or exploding gradients).

 

How does Transformer work?

Transformer in a sense mimics the encoder – decoder idea. It has two subparts in its network.

Both of them are repeated a couple of times (6 times in the paper that introduced Transformer in 2017 and 12 times in BERT architecture). Both the encoder and the decoder dispense with any recurrent neural networks and use only the attention mechanism and its variations; multi-head attention and masked multi-head attention (only in the decoder part). Additionally, transformer architecture uses normalisation layers which reduce the number of steps to optimize the network (gradient descent has weights that are on the same scale – mean of 0 and variance of 1), which reduces covariate shift (gradient dependencies between each layer). This normalisation mechanism might sound similar to a Batch Normalisation layer, but is not the same concept, as it normalizes whole layers, not just single batches (and that works only for fairly large batches).

 

How do transformer networks work in NLP?

The attention mechanism has another interesting property – it is position invariant, meaning that the words may be shuffled and yet it does not impact the Transformer network’s answer. That is why Transformer comes with yet another idea, a positional embedding layer which encodes the information about each word’s position.

Transformer did not significantly beat the then state of the art models, but it achieved very good results in a fraction of the training time, and hence is very popular.