What did I learn from CodeEmporium about BERT and Transfomers

BERT Neural Network Explained

LSTMs are slow(because of sequential processing) and not truly bidirectional
Transformers are fast and truly bidirections
- Enable fast processing via parallel architectures
- No activation functions
- Self-Attention mechanism
Stack up Encoders - BERT
Stack up Decoders - GPT
BERT has two components in its training phase
- Next Sentence Prediction
- Masked Language Model
Input
- Word vector representations
- Positional Encoding
- Sentence Index
Input Format - [CLS] + [Sentence 1] + [SEP] + [Sentence 2]
Output Format

[CLS] \( T_1, T_2,\ldots, T_N \) [SEP] \( T_1^\prime, T_2^\prime,\ldots, T_N^\prime \)

NLP with Neural Networks and Transfomers

Embeddings with Language Models
ElMO uses two layers of BiDirectional LSTMs
Open GPT uses a stack of decoders and is very fast
BERT is bidirectional encoding with transformer architecture
- Fast
- Feed in the entire sentence at one
- Learns context from both directions simulataneously
XLNET is the another improvement on BERT
Spacy has many functions that can be used to integrate with BERT, HuggingFace
HuggingFace has a ton of libraries on Transformers

RNN
- Many to Many Models
- Many to One Models
- One to Many Models
Transformer components
- Encoder : Input Embedding + Positional Encoder + Multi-Headed Attention Layer + Feed Forward layer
- Decoder : Output Embedding + Positional Encoder + Multi-Headed Attention Layer + Encoder-Decoder Attention + FeedForward Layer
Q,K,V to come up with Attention
Layer Normalization
Transformer code on TensorFlow available to play with