Attention is all you need

The following are my learning from the paper titled, Attention is all you need :

Using RNN’s for language modeling has been particularly painful as they take long time to train and have problems with learning representational encodings all at once
In Transformer architecture, the number of operations required to relate signals from two arbitrary input or output positions is a constant
Self-attention is an attention mechanism relating different positions of a single sentence in order to compute a representation of a the sequence
Transformer is a first transduction model relying entirely on self-attention to compute representations of its inputs and output without using sequence aligned RNNs or convolution networks
Learnt about the relationship between Induction, Deduction and Transduction
- Induction, derives the function from the given data, i.e. creates an approximating function
- Deduction derives the values of the given functions for points of interest
- Transduction derives the values of an unknown function for points of interest from the data

Example of Transduction algo is k-nearest neighbor algo
A transducer in the context of NLP is defined as a model that outputs one time step for each input time step provided
Many natural language processing (NLP) tasks can be viewed as transduction problems, that is learning to convert one string into another. Machine translation is a prototypical example of transduction and recent results indicate that Deep RNNs have the ability to encode long source strings and produce coherent translation
Model Details
- Encoder and Decoder
- Input is a positional encoding + Word Embedding
- Encoder comprises Multi-Head Attention Layer, Residual connection, Feed Forward network and a Layer Normalization layer
- Decoded network comprises Multi-Head Attention Layer + Keys and Values from Encoder and Queries from Decoder
- There are 6 stacks of encoder layers
- There are 6 stacks of decoder layers
Attention function can be described as a mapping between query and a set of key-value pairs to an output where query, keys, values and output are all vectors
Scaled Dot product attention

\begin{align} \text{Attention}(Q,K,V) & = \text{softmax} \left( {QK^T \over \sqrt{d_k}} \right) V \end{align}

MultiHead attention

\begin{align} \text{MultiHeadAttention}(Q,K,V) & = \text{Concat} (\text{head}_1, \text{head}_2, \dots, \text{head}_h) W^O \end{align}

where \(\text{head}_i \) corresponds to output from each attention layer

Encoder contains self-attention layers.
Decoder contains self-attention layers
Positional encoding is done via sine and cosine functions
Why do the authors use self-attention?
- Faster to train than RNN
- Total computational complexity per layer is reduced
- Path length between input and output positions is shorter can compared to RNN
Training
- Performed on WMT2014 dataset that contains 4.5 million sentence pairs
- Each training batch - 25000 source and target tokens
- Base model training time is 12 hours
- Big Model training time is 3.5 days
- 8 NVIDIA P100 GPUs needed
- Adam optimizer used
- Three types of regularization done - Residual dropouts + dropouts to sum of embeddings and positional encodings in both encoder and decoder layer
Results
- English to German - BLUE score of 28.4
- English to French - BLUE score of 41

Finally after many weeks, I sat down and read through the entire paper. Of course this would not have been possible with out the input from the following sources

Jay Allamar post
Attention is all you need - paper walk through by Yannic Kilcher
RASA - Attention paper walk through - 4 Videos
Attention paper walk through by Code Emporium
LSTM is Dead - Long live Transformers Meet up talk
ELMO + GPT2 + Transforms - How NLP cracked transfer learning - Jay Allamar

My immediate next steps is to work through Alladdin’s PyTorch videos on Seq to Seq and Attention pytorch codes. Hopefully by understanding their code, I will be able to get a good grasp of the transformer architecture

Contents