The following are my learning from the paper titled, Attention is all you need :

  • Using RNN’s for language modeling has been particularly painful as they take long time to train and have problems with learning representational encodings all at once
  • In Transformer architecture, the number of operations required to relate signals from two arbitrary input or output positions is a constant
  • Self-attention is an attention mechanism relating different positions of a single sentence in order to compute a representation of a the sequence
  • Transformer is a first transduction model relying entirely on self-attention to compute representations of its inputs and output without using sequence aligned RNNs or convolution networks
  • Learnt about the relationship between Induction, Deduction and Transduction
    • Induction, derives the function from the given data, i.e. creates an approximating function
    • Deduction derives the values of the given functions for points of interest
    • Transduction derives the values of an unknown function for points of interest from the data


  • Example of Transduction algo is k-nearest neighbor algo
  • A transducer in the context of NLP is defined as a model that outputs one time step for each input time step provided
  • Many natural language processing (NLP) tasks can be viewed as transduction problems, that is learning to convert one string into another. Machine translation is a prototypical example of transduction and recent results indicate that Deep RNNs have the ability to encode long source strings and produce coherent translation
  • Model Details
    • Encoder and Decoder
    • Input is a positional encoding + Word Embedding
    • Encoder comprises Multi-Head Attention Layer, Residual connection, Feed Forward network and a Layer Normalization layer
    • Decoded network comprises Multi-Head Attention Layer + Keys and Values from Encoder and Queries from Decoder
    • There are 6 stacks of encoder layers
    • There are 6 stacks of decoder layers
  • Attention function can be described as a mapping between query and a set of key-value pairs to an output where query, keys, values and output are all vectors
  • Scaled Dot product attention

\begin{align} \text{Attention}(Q,K,V) & = \text{softmax} \left( {QK^T \over \sqrt{d_k}} \right) V \end{align}

  • MultiHead attention

\begin{align} \text{MultiHeadAttention}(Q,K,V) & = \text{Concat} (\text{head}_1, \text{head}_2, \dots, \text{head}_h) W^O \end{align}

where \(\text{head}_i \) corresponds to output from each attention layer

  • Encoder contains self-attention layers.
  • Decoder contains self-attention layers
  • Positional encoding is done via sine and cosine functions
  • Why do the authors use self-attention?
    • Faster to train than RNN
    • Total computational complexity per layer is reduced
    • Path length between input and output positions is shorter can compared to RNN
  • Training
    • Performed on WMT2014 dataset that contains 4.5 million sentence pairs
    • Each training batch - 25000 source and target tokens
    • Base model training time is 12 hours
    • Big Model training time is 3.5 days
    • 8 NVIDIA P100 GPUs needed
    • Adam optimizer used
    • Three types of regularization done - Residual dropouts + dropouts to sum of embeddings and positional encodings in both encoder and decoder layer
  • Results
    • English to German - BLUE score of 28.4
    • English to French - BLUE score of 41

Finally after many weeks, I sat down and read through the entire paper. Of course this would not have been possible with out the input from the following sources

  • Jay Allamar post
  • Attention is all you need - paper walk through by Yannic Kilcher
  • RASA - Attention paper walk through - 4 Videos
  • Attention paper walk through by Code Emporium
  • LSTM is Dead - Long live Transformers Meet up talk
  • ELMO + GPT2 + Transforms - How NLP cracked transfer learning - Jay Allamar

My immediate next steps is to work through Alladdin’s PyTorch videos on Seq to Seq and Attention pytorch codes. Hopefully by understanding their code, I will be able to get a good grasp of the transformer architecture