What did I learn from CodeEmporium about BERT and Transfomers

BERT Neural Network Explained

  • LSTMs are slow(because of sequential processing) and not truly bidirectional

  • Transformers are fast and truly bidirections

    • Enable fast processing via parallel architectures
    • No activation functions
    • Self-Attention mechanism
  • Stack up Encoders - BERT

  • Stack up Decoders - GPT

  • BERT has two components in its training phase

    • Next Sentence Prediction
    • Masked Language Model
  • Input

    • Word vector representations
    • Positional Encoding
    • Sentence Index
  • Input Format - [CLS] + [Sentence 1] + [SEP] + [Sentence 2]

  • Output Format

    [CLS] \( T_1, T_2,\ldots, T_N \) [SEP] \( T_1^\prime, T_2^\prime,\ldots, T_N^\prime \)

NLP with Neural Networks and Transfomers

  • Embeddings with Language Models
  • ElMO uses two layers of BiDirectional LSTMs
  • Open GPT uses a stack of decoders and is very fast
  • BERT is bidirectional encoding with transformer architecture
    • Fast
    • Feed in the entire sentence at one
    • Learns context from both directions simulataneously
  • XLNET is the another improvement on BERT
  • Spacy has many functions that can be used to integrate with BERT, HuggingFace
  • HuggingFace has a ton of libraries on Transformers

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

  • RNN
    • Many to Many Models
    • Many to One Models
    • One to Many Models
  • Transformer components
    • Encoder : Input Embedding + Positional Encoder + Multi-Headed Attention Layer + Feed Forward layer
    • Decoder : Output Embedding + Positional Encoder + Multi-Headed Attention Layer + Encoder-Decoder Attention + FeedForward Layer
  • Q,K,V to come up with Attention
  • Layer Normalization
  • Transformer code on TensorFlow available to play with