It has been almost an year, since I had used transformers library in my work. There was a need to use it again for a different project. In my revisit to the library, I found that the current version of the library by Hugging Face has many new features incorporated in the library. The library has become powerful and is probably the best place to go, for using various pre-trained models. This blog post gives a list of useful points mentioned by Hugging Face on their website

Pipeline

Pipeline function returns an end-to-an end object that performs an NLP task on one or more several texts. It does the pre-processing of the text, using the relevant model and then does the post-processing of the text. Some of the usecases of pipeline function are

Sample usecases

Sentiment Analysis

1
2
classifier = pipeline("sentiment-analysis")
classifier("I love emacs")
label : POSITIVE score : 0.9997842907905579
1
2
classifier = pipeline("sentiment-analysis")
classifier(["I love playing with my daughter", "I dont like wasting time"])
label : POSITIVE score : 0.9996106624603271
label : NEGATIVE score : 0.9852923154830933
1

Sentence classification

1
2
classifier = pipeline("zero-shot-classification")
classifier("I love emacs", candidate_labels=["programming", "leisure", "family"])
sequence : I love emacs labels : (leisure programming family) scores : (0.4507399797439575 0.43812379240989685 0.11113625019788742)

Text Generation

1
2
3
4
5
generator = pipeline("text-generation", model="distilgpt2")
test = generator(
    "I am trying to learn how to swim because ", max_length=60, num_return_sequences=2
)
test[0]["generated_text"].encode("utf-8")
1
b"I am trying to learn how to swim because \xc2\xa0 i was looking for a\xc2\xa0 good way for my body\xc2\xa0 \xc2\xa0i couldn't get enough of a swim I needed, \xc2\xa0 i knew I could do things better. I was pretty exhausted in class though, so I was\xc2\xa0\xc2\xa0 going"

Masked Sentence Task

1
2
unmasker = pipeline("fill-mask")
unmasker("I am trying to learn <mask> for the past one year", top_k=2)
sequence : I am trying to learn English for the past one year score : 0.1867246925830841 token : 2370 token_str : English
sequence : I am trying to learn something for the past one year score : 0.052237771451473236 token : 402 token_str : something

NER

1
2
ner = pipeline("ner", grouped_entities=True)
ner("My friend Peter, joined my team in Singapore")
entity_group : PER score : 0.99958867 word : Peter start : 10 end : 15
entity_group : LOC score : 0.9998326 word : Singapore start : 35 end : 44

Question Answering

1
2
3
4
5
question_answer = pipeline("question-answering")
result = question_answer(
    question="where am I running ?", context="I was jogging in a park near my house"
)
result
score : 0.3953234553337097 start : 17 end : 37 answer : a park near my house

Summarizer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
summarizer = pipeline("summarization")

text = """ Higher stocks and pressure on the U.S dollar reflect optimism in markets, with forward-looking metrics sending encouraging signals, too [nL1N2BV08D], but FX dealers shouldn't be complacent. While falling FX implied volatility suggests a decreasing risk of excessive actual volatility, supply is starting to meet demand at levels still well above pre-crisis lows. That means option players aren't ruling out further bouts of real volatility over coming sessions, albeit less than those experienced at the start of the crisis in early March. FX watchers should also look at risk reversals - they show which currency, in a pair, is deemed the most vulnerable. One-month EUR/USD risk reversals are losing EUR put implied volatility premiums quickly. Combined with falling implied volatility, that suggests paring of downside risk [nL1N2BV066]. However, in pairs like cable, implied volatility setbacks have been milder and the recent GBP put risk reversal premium slower to ease, suggesting the perceived risk of GBP/USD falling, rather than gaining """

summarizer(text)

summary_text : Falling FX implied volatility suggests decreasing risk of excessive actual volatility . But supply is starting to meet demand at levels still well above pre-crisis lows . That means option players aren’t ruling out further bouts of real volatility over coming sessions . FX watchers should also look at risk reversals - they show which currency, in a pair, is deemed the most vulnerable .

What happens in a pipeline ?

  • There are three stages in a pipeline
    • Tokenizer: Raw Text are converted to tokens. Special tokens are added and then converted to token ids
    • Model
    • Post Processing

Loading from AutoTokenizer=

The AutoTokenizer class can load the tokenizer for any checkpoint

1
2
3
4
5
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_inputs = ["Started with Google BERT", "Build and train state of art NLP models"]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
[list(inputs.keys()), inputs["input_ids"].shape, inputs["attention_mask"].shape]
(input_ids attention_mask) torch.Size ((2 11)) torch.Size ((2 11))

Apply the model via AutoModel

The AutoModel class loads a model without its pretraining head.

1
2
3
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
outputs.last_hidden_state.shape
1
torch.Size([2, 11, 768])

Use AutoModelForXxx for a task

Each AutoModelForXxx class loads a model suitable for a specific task

1
2
3
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
outputs.logits
1
2
tensor([[ 1.6011, -1.3941],
        [-3.6381,  3.8054]], grad_fn=<AddmmBackward>)

Post processing

Apply transformation on the logits

1
2
3
4
5
raw_inputs = ["The stock fell by 5%", "The market index has risen today",  "It is a volatile market today"]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
((torch.nn.functional.softmax(outputs.logits))*100).round()
tensor (((100 0) (0 100) (99 1)) grad_fn=) (0: NEGATIVE 1: POSITIVE)

Transfer Learning

  • Use the weights of Model A used for Task A in Model B for Task B
  • Training from scratch requires more data and more compute to achieve comparable results
  • In computer vision, pre-training models are often used
  • GPT2 was pretrained on 40 GB of internet text posted by users on Reddit
  • Usually fake tasks are used to create a pretraining model. A few examples are
    • Mask Sentence prediction
    • Next Sentence prediction
    • Next word prediction
  • Transfer learning is applied by dropping the head of the pretrained model while keeping its body
  • Pretrained model limitation is that it also transfers its bias
  • There are a lot of biases in GPT3 model. Open AI acknowledges the bias and suggests that the model not be used in human interactions

Transformer architecture

  • Transformer is based on the attention mechanism
  • Transformer has two pieces - encoder and decoder
  • Encoder
    • Bi-directional and self-attention mechanism
    • Encoder Only architecture - BERT, RoBERTa, ALBERT
    • Get the numerical representation for each word
    • Encoder spits out the contextual representation of each word
    • Good at extracting meaningful information
    • Used for sequence classification, question answering, masked language modeling, NLU
  • Decoder
    • Uni-directional
    • Auto regressive
    • Masked Self-attention
    • Great at causal tasks; generating sequences
    • NLG - Natural Language generation, is a use case
    • Example of decoders - GPT-2, GPT Neo
  • Popular Encoder-Decoder models available via transformers
    • BART
    • ProphetNet
    • mT5
    • M2M100
    • T5
    • Pegasus
    • MarianMT
    • mBART

Auto Model API

  • AutoModel API allows you to instantiate a pretrained model from any checkpoint
  • Download the config and model file
  • config class is parsed from config file. Based on the class, the relevant model is created
  • model config is then read from the config class. By using the model configuration, the model is loaded
  • The model file is used to load the weights in to the model
  • AutoConfig is used to instantiate the model from a config
1
2
bert_config = BertConfig.from_pretrained('bert-base-uncased')
bert_config
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.8.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}
  • One can use the config file to make changes to the architecture and instantiate the model
1
2
3
bert_config = BertConfig.from_pretrained('bert-base-uncased', num_hidden_layers=10)
bert_model = BertModel(bert_config)
bert_model
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (2): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (3): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (4): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (5): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (6): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (7): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (8): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (9): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

Tokenizer

  • Popular tokenization algorithms are word-based, character-based and sub-word based.
  • Word based tokenization : Splitting based on spaces, punctuation. Each word has a specific ID
    • Limitation - very similar words have entirely different meanings. Model assigns different ids
    • Vocabulary can end up very large
    • Large vocabularies result in heavy models
    • Opt to limit the number of words - Take the most frequent words
    • OOV results in loss of information
  • Character based tokenization
    • Splitting raw text into characters
    • Vocabularies are slimmer
    • Intuitively characters do not hold more information as the words would hold
    • Sequences are translated in to a very large sequence of tokens
  • Subword based tokenization
    • Middle ground between word and character-based algorithms
    • Frequently used words should not be split in to smaller subwords
    • Rare words should be decomposed into meaningful subwords
    • Identify similar syntactic or semantic situations in text
    • Can identify start of word tokens
    • Most models obtaining state-of-the-art results in English today use some kind of subword tokenization algorithm
      • Word Piece : BERTm DistilBERT
      • Unigram: XLNet, ALBERT
      • Byte-Pair Encoding: GPT-2, RoBERTa
    • Help in reducing vocab sizes and share information from other words

Tokenization

1
2
3
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize("Let's look at this problem again")
",".join(tokens)
1
let,',s,look,at,this,problem,again
1
tokenizer.convert_tokens_to_ids(tokens)
2292 1005 1055 2298 2012 2023 3291 2153

Adding Special Tokens

prepare_for_model knows the special tokens that the model expects

1
2
3
input_ids = tokenizer.convert_tokens_to_ids(tokens)
final_inputs = tokenizer.prepare_for_model(input_ids)
final_inputs
input_ids : (101 2292 1005 1055 2298 2012 2023 3291 2153 102) token_type_ids : (0 0 0 0 0 0 0 0 0 0) attention_mask : (1 1 1 1 1 1 1 1 1 1)
1
tokenizer.decode(input_ids)
1
let's look at this problem again
  • RoBERTa uses <s> and </s> like tokens for special tokens
  • tokenizer can be used on any input senntence and the output is a triad, i.e. input_ids_, token_type_ids and attention_mask

Batching Inputs together

  • Sentences we want to group inside a batch will often have different lengths
  • We usually pad the smaller sentences to the length of the longest one
  • Every model is usually padded with a token that has a specific lookup id
1
tokenizer.pad_token_id
1
0
  • All the padding is done behind the scene by tokenizer
1
2
results =tokenizer(["I love emacs"," Emacs is the best IDE ever"])
results['attention_mask']
1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1

Hugging Face Datasets

  • Datasets uses Apache arrow. This means only the datasets that are in use, are loaded and it is unlikely that you are going to get an out of memory error

Datasets

1
2
raw_datasets = load_dataset("glue", "mrpc")
raw_datasets["train"]
1
2
3
4
Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})
1
pd.Series(raw_datasets["train"].features)
1
2
3
4
5
sentence1                       Value(dtype='string', id=None)
sentence2                       Value(dtype='string', id=None)
label        ClassLabel(num_classes=2, names=['not_equivale...
idx                              Value(dtype='int32', id=None)
dtype: object
  • map methods allows you to apply a function over all the splits of a given dataset
1
2
3
4
5
6
7
checkpoint ="bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(example):
    return tokenizer(example['sentence1'], example['sentence2'], padding='max_length', truncation=True, max_length=128)

tokenized_datasets = raw_datasets.map(tokenize_function) pd.Series(tokenized_datasets.column_names)

1
2
3
4
train         [sentence1, sentence2, label, idx, input_ids, ...
validation    [sentence1, sentence2, label, idx, input_ids, ...
test          [sentence1, sentence2, label, idx, input_ids, ...
dtype: object
  • You can preprocess faster by using the option batched. The applied function will then receive multiple examples at each call
  • There are many functions that you can use from Datasets library
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
checkpoint ="bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(example):
    return tokenizer(example['sentence1'], example['sentence2'], padding='max_length', truncation=True, max_length=128)

tokenized_datasets = raw_datasets.map(tokenize_function) tokenized_datasets = tokenized_datasets.remove_columns(["idx", "sentence1", "sentence2"]) tokenized_datasets = tokenized_datasets.rename_column("label", 'labels') tokenized_datasets = tokenized_datasets.with_format("torch") tokenized_datasets['train']

1
2
3
4
Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3668
})
  • Text classification can also be applied to a pair of sentences
  • GLUE benchmark - 10 datasets for text classification
    • Datasets with single sentences - COLA, SST-2
    • Datasets with pairs of sentences - MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI
    • 8 of 10 datasets are focused on pairs of sentences
  • Models like BERT are pretrained to recognize relationships between two sentences
  • We need to pad sentences of different lengths to make batches. There are various ways to pad
    • Fixed Padding: Pad all the sentences in the whole dataset to the maximum length in the dataset
    • Dynamic Padding: Pad the sentence at the batch creation, to the length of the longest sentence
      • Dynamic padding might not work well on all GPUs
      • To apply dynamic padding, the padding is removed from pre-processing is removed
      • One can use DataCollatorWithPadding class to create dynamic padding for each batches

Trainer API

  • Trainer API has many useful functions that allows you to sidestep from writing your own training loop
  • Trainer API can be used to easily train or fine-tune Transformer models
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer
from datasets import load_metric
import numpy as np
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(examples): return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

training_args = TrainingArguments("test-trainer") training_args = TrainingArguments("test-trainer", per_device_train_batch_size = 16, per_device_eval_batch_size = 16, num_train_epochs = 5, learning_rate = 2e-5, weight_decay=0.01) trainer = Trainer(model, training_args, train_dataset = tokenized_datasets['train'], eval_dataset = tokenized_datasets['validation'], data_collator = data_collator, tokenizer = tokenizer)

predictions = trainer.predict(tokenized_datasets['validation']) metric = load_metric("glue","mrpc") preds = np.argmax(predictions.predictions, axis=-1) metric.compute(predictions = preds, references = predictions.label_ids)

  • One can use the Datasets library and prepare the dataset. Subsequently one can use the PyTorch functions to create DataLoader and follow it up by training using standard PyTorch code

HuggingFace Accelerate

  • There are multiple setups on which you one can run your training
  • Trainer API can handle all those setups
  • Accelerate has been designed to retain control of the training loop. It can be used across all types of infra that are used to train the network
  • Accelerate also handles distributed evaluation

Takeaway

It is awesome to see that HuggingFace has created a unified interface for many pre-trained models. Also many of the common tasks in NLP modeling has been encapsulated via pipeline functionality. This library truly democratizes the usage of pre-trained models for many practical real life usecases.