What did I learn from transformer primer from Jay Allamar

  • The basic idea of transformer is that it contains multi-headed attention and positional encoding
  • If one pops open an encoding layer in the encoder, it contains the following parts
    • Multi-headed attention layers
    • Layer Normalization
    • Residual Connections
    • Fully Connected layers
  • If one pops open a decoding layer in the decoder, it contains the following parts:
    • Multi-headed attention
    • Encoder-Decoder attention
    • Residual Connection
    • Layer Normalization
    • Fully Connected Layers
  • \( softmax \times \frac{Q K^T}{ \sqrt{d+k}} V = Z\) gives the self attention in matrix form
  • Positional encoding using sines and cosines
  • Beam Search - I had learnt long ago in Andrew Mg’s course. Relearnt from the nice explanation in this blog post
  • Visualization of positional encoding using sines and cosines
  • Need to explore the PyTorch implementation of Transformers
  • Visuals that captures encoding and decoding phase