What did I learn from attention primer from Jay Allamar

  • Seq2Seq Models have an encoder RNN and a decoder RNN
  • The input to the encoder RNN is one word at a time. The encoder RNN has an initial hidden state that is initialized at t0. This hidden state gets updated as and when new input words are thrown in to the sentence
  • The final hidden state is then sent to decoder. The decoder has its own hidden.
  • This hidden state from encoder + hidden state from the decoder + starting word for the decoder is used to generate a set of words from the decoder
  • The attention mechanism differs from the above Seq2Seq models as it considers ALL the hidden states of the encoder at ALL the steps, passes it through a layer that does a weighted average before passing it to the decoder. The decoder uses this weighted average of the encoder hidden states and generates an output vector

Overview of Neuro Machine Translation with Attention