Transformer Block#

Attention is all you need#

The transformer model comes out of the paper attention is all you need. The paper shows how powerful pure attention mechanisms can be. They introduced a new kind of attention mechanism called self-attention, which we’ll discuss later.

The significance about self-attention is that with only attention mechanism, the model achieves state-of-the-art performance on many datasets, in a field previously dominated by RNNs.

How does the transformer work?#

The transformer architecture is based on a Seq2Seq model. Traditionally, a seq2seq model is basically an encoder and a decoder, like auto-encoders, but both encoder and decoder are RNNs. The encoder first process through the input, then feeds the encoder’s RNN state or output to the decoder to decode the full sentence. The idea is that the encoder should be able to encode the input into some kind of representation that contains the meaning of the sentence, and the decoder should be able to understand that representation.

In the case of transformers, because it’s not an RNN, so instead of RNN state, the attention produced by the encoder is used and sent to the decoder. Decoder uses that global information to produce the output.

Transformer encoder#

The encoding component is a stack of smaller encoders. An encoder does the following thing

  1. Calculate self-attention score for the input \( I \).

  2. Weigh the input by self-attention scores \( S(I) \).

  3. Pass it through an add-and-normalize layer \( O = I + S(I) \).

  4. Feed the processed data through a linear layer \( F = f(O) \).

  5. Perform activation on the linear layer’s output \( F' = \sigma(F) \).

  6. Multiply the mutated output with the output itself \( F'' = F'F \).

  7. Pass it through an add-and-normalize layer \( F'' + O \).

An add-and-normalize layer performs a residual add, adding the input and the processed input.

Transformer decoder#

The decoding component look a lot like the encoding component, it’s also a stack of decoders. Decoders are basically encoders, but they take the attention provided by encoders, and perform step \( 3 \) twice, the first time adds it by the decoder generated attention, the second time adds by the encoder generated attention.

For transformers, the decoder is an auto-regressive model. In inference mode, what it does is no different from any other decoder. It takes in the sequence it previously generated (it uses it’s the first token being predicted), and predicts a new token. So what is the encoder doing? Turns out it is used for storing attention information. The stored information is then passed to decoders (on different layers) so that decoders know the meaning of the input sentence.

Positional encoding#

Words in a sentence have different meanings if they are ordered differently. The sentence Alice ate Bob is very different in meaning to Bob ate Alice. Well, at least for Alice and Bob. For RNNs, that isn’t an issue. Because RNNs run over a sentence sequentially. So it will either see Alice or Bob first, and knows who appears to be eaten. However, transformers have no way of knowing who comes first because of how self-attention mechanism is symmetry to each position.

That is the reason we need to add information for position to the model. In Attention is all you need, a positional encoding is added to the input. A positional encoding is basically an embedding, with different values for every indices, so that the model knows what a word’s position is when it processes it.

A very interesting fact is that changing the order of the tokens does not actually change the output of the model (unlike RNNs), as long as the right positional encoding is associated with the right position. That means that after applying (usually by adding) positional encoding to the input word embedding vector, you can shuffle the order of the vector (along the time axis) all you want without affecting the output of the model. Very cool indeed.