Transformer Primer : Jay Allamar

What did I learn from transformer primer from Jay Allamar

The basic idea of transformer is that it contains multi-headed attention and positional encoding
If one pops open an encoding layer in the encoder, it contains the following parts
- Multi-headed attention layers
- Layer Normalization
- Residual Connections
- Fully Connected layers
If one pops open a decoding layer in the decoder, it contains the following parts:
- Multi-headed attention
- Encoder-Decoder attention
- Residual Connection
- Layer Normalization
- Fully Connected Layers
\( softmax \times \frac{Q K^T}{ \sqrt{d+k}} V = Z\) gives the self attention in matrix form
Positional encoding using sines and cosines
Beam Search - I had learnt long ago in Andrew Mg’s course. Relearnt from the nice explanation in this blog post
Visualization of positional encoding using sines and cosines
Need to explore the PyTorch implementation of Transformers
Visuals that captures encoding and decoding phase

Contents