Introduction:

In the previous chapter we learned how to use tokenizers and pretrained models to make predictions.
In this chapter we will see how to Fine-tune a model on our Dataset by learning:
- How to prepare a large dataset for the finetuning process
- How to use the high level API trainer to finetune a model
- How to leverage the HuggingFace Accelerate library to easily run that custom training loop on any distributed setup
But first let’s do the usuall by picking an architecture/model/tokenizer, and then train it some sample data:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW

mdl_ckp = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(mdl_ckp)
model = AutoModelForSequenceClassification.from_pretrained(mdl_ckp)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "This course is amazing!"]
batch = tokenizer(sequences, truncation=True, padding=True, return_tensors='pt')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

# will be axplained later
batch['labels'] = torch.tensor([1, 1])

# training
optimizer = torch.optim.AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

Of course training a model on 2 sentences will not yield a good results
So we need to introduce it to a larger dataset
In this chapter we will work with: example the MRPC (Microsoft Research Paraphrase Corpus) dataset.
- The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing)
- This is one of the 10 datasets composing the GLUE benchmark, which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks

Loading Datasets From The Hub:

We can easily download a dataset from the Hub just like we did with models before:

# load dataset
from datasets import load_dataset
raw_ds = load_dataset('glue', 'mrpc')

raw_ds

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

Datasets are presented as DatasetDict which is an object dictionary that our datset is organized by.
- Here we have our training-set, validation-set and test-set.
- Each set 2 keys: features and num_rows.
- Features has: sentence1, sentence2, label, idx
- sentence1&2 represent the pair we need to train our model on and predict whether its paraphrased or not.
We can access each pair of sentences in our raw_datasets object by indexing, like with a dictionary:

# training set
train_ds = raw_ds['train']
train_ds[22]

{'sentence1': 'A BMI of 25 or above is considered overweight ; 30 or above is considered obese .',
 'sentence2': 'A BMI between 18.5 and 24.9 is considered normal , over 25 is considered overweight and 30 or greater is defined as obese .',
 'label': 0,
 'idx': 24}

Here we see the pair of sentences, the label and the index of that pair.
Labels are already int value so we won’t need to preprocess them.
What means label: 0?

# what means each label
train_ds.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

0 for not_equivalent and 1 for equivalent

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(mdl_ckp)
train_seq1 = tokenizer(train_ds['sentence1'])
train_seq2 = tokenizer(train_ds['sentence2'])

Preprocessing the dataset:

We can’t just pass two sequences to the model and expect to get proper prediction about whether these sequences are paraphrased or not.
We need to apply a proper preparation of the data in order feed the model pairs of sequences instead 2 sentences separtly.
This can be done first whith the tokenizer, we create pairs of tokens and compute them the way BERT expect:

#example
input = tokenizer('this is the first sentence', 'this is number 2')
input

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 102, 2023, 2003, 2193, 1016, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The tokenizer output: input_ids, attention_mask, but also token_type_ids.
This feature tells us that the tokenizer is aware that we are dealing with the two sentences, each is represented by either 0 or 1

input.token_type_ids

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

If we convert each input_ids back to token we can have and idea of what happend:

tokenizer.convert_ids_to_tokens(input['input_ids'])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '[SEP]',
 'this',
 'is',
 'number',
 '2',
 '[SEP]']

So we see the model expects the inputs to be of the form [CLS] sentence1 [SEP] sentence2 [SEP]
Note that not all model’s tokenizer can perform this because the way each model is trained, here BERT have seen pairs and knows how to deal with them.
We can then pass pairs of sentences to the tokenizer like this:

tokenized_dataset = tokenizer(train_ds['sentence1'], train_ds['sentence2'], truncation=True, padding=True)

This way of tokenizing the whole dataset is not ideal since it requires huge RAM to store the dataset while we process it.
It will also return dictionary keys: attention_mask, input_ids, token_type_ids and its values.
To work around this problem we will use map() method which will keep data as dataset, and also it will give us more flexibility if we need more preprocessing more than just tokenizing.
map() works by applying a function to each element of the dataset, let’s create a function that tokenize pairs of sentences so the map method use it over the whole dataset:

def func_tokenize(example):
  return tokenizer(example['sentence1'], example['sentence2'], truncation=True)

This function takes a dictionary (like the items of our dataset) and returns a new dictionary with the keys input_ids, attention_mask, and token_type_ids.
We didn’t include the padding here, because it’s not sufficient to pad the whole dataset based on the longest sentence, when we can do it on the batch level
We can pass the batching as argument in the map() method

tokenized_datasets = raw_ds.map(func_tokenize, batched=True)

Let’s take a look on pair exmaple from the training dataset:
We get what we expected, the 3 keys representing tokenization process, plus the dictionary key we already have: label, idx and sentence1&2:

tokenized_datasets['train'][55].keys()

dict_keys(['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'])

Now we have to deal with the padding since we decided to apply it o the batch-level, so each batch will have its own longest sequence to pad on.
So we need to do a process called: Dynamic Padding.

Dynamic Padding:

Putting the samples together in a single batch is done throught a function called: Collate function.
Collate function convert our samples to Pytorch tensors and concatenate them.
But this can’t be done without padding, otherwise we will get different shapes for tensors.
As we said before the padding process should be done on batch level, which means each batch will have its samples padded according to the longest sequence otherwise we will get samples a with lot of paddings.
In practice we have to define a collate-function that apply the correct amount of padding to the items of the dataset we want to batch together.

from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Here we test this collate function on some samples from training set.
We need first to remove columns idx, sentence1, sentence2 since we don’t need them.
Let’s have a look at the length of each entry in the batch:

samples = tokenized_datasets['train'][:8]
samples = {k: v for k, v in samples.items() if k not in ['idx', 'sentence1', 'sentence2']}
[len(x) for x in samples['input_ids']]

[50, 59, 47, 67, 59, 50, 62, 32]

samples.keys()

dict_keys(['label', 'input_ids', 'token_type_ids', 'attention_mask'])

These samples are varying between 32 and 67, so our job here is to pad all the other sequence in this particular in respect to the treshold.

sample_batch = data_collator(samples)
{k:v.shape for k, v in sample_batch.items()}

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

Let’s check again if our input_ids have the same length:

[len(i) for i in sample_batch['input_ids']]

[67, 67, 67, 67, 67, 67, 67, 67]

Fine-tuning a model with the Trainer API:

We can use the Trainer class to fine-tune any pretrained model on our dataset.
Now we need to prepare the enviroment for Train.train() which will be done on GPU.
But first we have to define TrainingArguments that contains all the Hyperparameters the Trainer will use for the training and evaluation.
We just need to provide where the model will be saved, as long as the checkpoints, everything else is set as defaults which will work for learning purposes.

output_dir = 'my_folder'

from transformers import TrainingArguments

training_args = TrainingArguments(output_dir=output_dir)

For this dataset we will use AutoModelForSequenceClassification class with 2 labels:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(mdl_ckp, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

The BERT model we instantiated now will be the back-bone for our process, but we will delete the head and add the AutoModelForSequenceClassification head that fit our situation.
The weights are initialized randomly for the head, which means we need to train them from scratch which is exactly what we will do:

from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

When the tokenizer is passed as argument like this, usually we won’t need to define data_collator since it already defined with DataCollatorWithPadding, and the Trainer will retrieve it from the tokenizer anyway.
Now we fine-tune the model on our dataset:

trainer.train()

[1377/1377 03:21, Epoch 3/3]

Step	Training Loss
500	0.518800
1000	0.272700

TrainOutput(global_step=1377, training_loss=0.3197195403543, metrics={'train_runtime': 204.8739, 'train_samples_per_second': 53.711, 'train_steps_per_second': 6.721, 'total_flos': 405626802939840.0, 'train_loss': 0.3197195403543, 'epoch': 3.0})

One thing we didn’t include in our Trainer is the evaluating strategy.
We don’t have an idea about how good or bad our nodel because:
- We didn’t tell the Trainer to evaluate during training by setting evaluation_strategy to either “steps” (evaluate every eval_steps) or “epoch” (evaluate at the end of each epoch).
- We didn’t provide the Trainer with a compute_metrics() function to calculate a metric during said evaluation (otherwise the evaluation would just have printed the loss, which is not a very intuitive number).

Evaluation:

First we need to build a compute_metrics() function in order to use it in the next training.
The function takes a EvalPrediction object as argument, which is basically a named tuple with:
- predictions field
- label_ids field
Here we get some predictions from our model:

predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)

The output of the predict() method is another named tuple with three fields: predictions, label_ids, and metrics.
Metrics represent the loss on the dataset, as well as the time metrics, how much it takes the predictions on total average.
As we see here, predictions is a two-dimensional array with shape 408 x 2 (408 being the number of elements in the dataset we used).
It represent the logits for each element of the dataset we passed to predict() .
To transform them into predictions that we can compare to our labels, we need to take the index with the maximum value on the second axis:

import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

Now we have ot build our compute_metrics() function in order to compare the predictions of the model with the actual labels.
We will use the metrics from the huggingFace library **’evaluate**, we just need to loa the meetrics associated with the dataset we usedmrpc`:

import evaluate
metric = evaluate.load('glue', 'mrpc')
metric.compute(predictions= preds, references= predictions.label_ids)

{'accuracy': 0.8529411764705882, 'f1': 0.8993288590604027}

Now let’s wrap everething in a single function:

def compute_metrics(eval_preds):
  metric = evaluate.load('glue', 'mrpc')
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis= -1)
  return metric.compute(predictions= predictions, references= labels)

Now let’s use it the training loop so it report all the metrics at the end of each epoch:

training_args = TrainingArguments('test-train', evaluation_strategy= 'epoch')
trainer= Trainer(
    model,
    training_args,
    train_dataset= tokenized_datasets['train'],
    eval_dataset= tokenized_datasets['validation'],
    data_collator= data_collator,
    tokenizer= tokenizer,
    compute_metrics= compute_metrics
)

trainer.train()

[1377/1377 03:39, Epoch 3/3]

Epoch	Training Loss	Validation Loss	Accuracy	F1
1	No log	0.681234	0.843137	0.893333
2	0.162700	0.998050	0.840686	0.892562
3	0.051500	1.032399	0.852941	0.898305

TrainOutput(global_step=1377, training_loss=0.08403856129254997, metrics={'train_runtime': 219.6964, 'train_samples_per_second': 50.087, 'train_steps_per_second': 6.268, 'total_flos': 405626802939840.0, 'train_loss': 0.08403856129254997, 'epoch': 3.0})

In this section we fine-tuned a model on a dataset by using the Trainer API, which minimize the work we have to since it works out of the box.
Trainer can be used in most NLP tasks, but what if we need to do everything manualy by using pure Pytorch?
In next section we will build the same Trainer by hand.

# A full training

In this section we will try to achieve the same results we had with Trainer API.
Since we already done with the preprocessing of the dataset, we just have to to some tweaks regarding some columns of the dataset we won’t use:

tokenized_ds = tokenized_datasets.remove_columns(['sentence1', 'sentence2', 'idx'])
tokenized_ds = tokenized_ds.rename_column('label', 'labels')
tokenized_ds.set_format('torch')
tokenized_ds['train'].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

Now we need to define the DataLoader that will help us to feed the model the dataset in batches:

import torch
from torch.utils.data import DataLoader

train_dataloader = DataLoader(tokenized_ds['train'], shuffle= True, batch_size= 8, collate_fn= data_collator)
eval_dataloader = DataLoader(tokenized_ds['validation'], batch_size= 8, collate_fn= data_collator)

Check the training data loader:

for batch in train_dataloader:
  break
{k:v.shape for k,v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 70]),
 'token_type_ids': torch.Size([8, 70]),
 'attention_mask': torch.Size([8, 70])}

Make sure that everything will go smoothly during the training , we pass a batch to the model:

model = AutoModelForSequenceClassification.from_pretrained(mdl_ckp, num_labels= 2)
output= model(**batch)
output.loss, output.logits.shape

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

(tensor(0.5792, grad_fn=<NllLossBackward0>), torch.Size([8, 2]))

Everything looks fine, before we build the training loop we just need 2 things:
- optimizer
- learning-rate scheduler
Sonce we want to replicate the Trainer and its defaults parameters, the optimizer used by Trainer is Adam, we will use a slightly different optimizer AdamW:

optimizer = AdamW(model.parameters(), lr= 5e-5)

/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

Finally the learning-rate scheduler used by the Trainer is just a simple linear decay from the highest learning-rate 5e-5 to 0.
In order to define it we just need to know the number of training steps.

from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler= get_scheduler(
    'linear',
    optimizer= optimizer,
    num_warmup_steps = 0,
    num_training_steps = num_training_steps
)

As we mension before, Trainer works out of the box with any set of hardware we have, but now we have to set the GPU as device during the training:

device = torch.device('cuda') if torch.cuda.is_available else torch.device('cpu')
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
)

Training Loop:

Now we have all the ingredients to start out training loop:

from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
  for batch in train_dataloader:
    batch = {k:v.to(device) for k, v in batch.items()}
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()

    optimizer.steps()
    lr_scheduler.steps()
    optimizer.zero_grad()
    progress_bar.update(1)

Evaluating Loop:

Evaluation loop is basically the same as compute_metrics() function we built before:

metric = evaluate.load('glue', 'mrpc')
model.eval()
for batch in eval_dataloader:
  batch = {k:v.to(device) for k, v in batch.itmes()}
  with torch.no_grad():
    outputs = model(**batch)

  logits = outputs.logits
  predictions = torch.argmax(logits, axis= -1)
  metric.add_batch(predictions=predictions, references= batch['labels'] )
metric.compute()

To recap, in this chapter we :

- Learned about datasets in the Hub
- Learned how to load and preprocess datasets, including using dynamic padding and collators
- Implemented your own fine-tuning and evaluation of a model
- Implemented a lower-level training loop