import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW
Introduction:
- In the previous chapter we learned how to use tokenizers and pretrained models to make predictions.
- In this chapter we will see how to Fine-tune a model on our Dataset by learning:
- How to prepare a large dataset for the finetuning process
- How to use the high level API trainer to finetune a model
- How to leverage the HuggingFace Accelerate library to easily run that custom training loop on any distributed setup
- But first let’s do the usuall by picking an architecture/model/tokenizer, and then train it some sample data:
= 'bert-base-uncased'
mdl_ckp = AutoTokenizer.from_pretrained(mdl_ckp)
tokenizer = AutoModelForSequenceClassification.from_pretrained(mdl_ckp)
model = ["I've been waiting for a HuggingFace course my whole life.", "This course is amazing!"]
sequences = tokenizer(sequences, truncation=True, padding=True, return_tensors='pt') batch
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
# will be axplained later
'labels'] = torch.tensor([1, 1]) batch[
# training
= torch.optim.AdamW(model.parameters())
optimizer = model(**batch).loss
loss
loss.backward() optimizer.step()
- Of course training a model on 2 sentences will not yield a good results
- So we need to introduce it to a larger dataset
- In this chapter we will work with: example the MRPC (Microsoft Research Paraphrase Corpus) dataset.
- The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing)
- This is one of the 10 datasets composing the GLUE benchmark, which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks
Loading Datasets From The Hub:
- We can easily download a dataset from the Hub just like we did with models before:
# load dataset
from datasets import load_dataset
= load_dataset('glue', 'mrpc') raw_ds
raw_ds
DatasetDict({
train: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 3668
})
validation: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 408
})
test: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 1725
})
})
- Datasets are presented as DatasetDict which is an object dictionary that our datset is organized by.
- Here we have our training-set, validation-set and test-set.
- Each set 2 keys: features and num_rows.
- Features has:
sentence1
,sentence2
,label
,idx
sentence1&2
represent the pair we need to train our model on and predict whether its paraphrased or not.
- We can access each pair of sentences in our raw_datasets object by indexing, like with a dictionary:
# training set
= raw_ds['train']
train_ds 22] train_ds[
{'sentence1': 'A BMI of 25 or above is considered overweight ; 30 or above is considered obese .',
'sentence2': 'A BMI between 18.5 and 24.9 is considered normal , over 25 is considered overweight and 30 or greater is defined as obese .',
'label': 0,
'idx': 24}
- Here we see the pair of sentences, the label and the index of that pair.
- Labels are already
int
value so we won’t need to preprocess them. - What means
label: 0
?
# what means each label
train_ds.features
{'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
'idx': Value(dtype='int32', id=None)}
- 0 for
not_equivalent
and 1 forequivalent
from transformers import AutoTokenizer
= AutoTokenizer.from_pretrained(mdl_ckp)
tokenizer = tokenizer(train_ds['sentence1'])
train_seq1 = tokenizer(train_ds['sentence2']) train_seq2
Preprocessing the dataset:
- We can’t just pass two sequences to the model and expect to get proper prediction about whether these sequences are paraphrased or not.
- We need to apply a proper preparation of the data in order feed the model pairs of sequences instead 2 sentences separtly.
- This can be done first whith the tokenizer, we create pairs of tokens and compute them the way BERT expect:
#example
input = tokenizer('this is the first sentence', 'this is number 2')
input
{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 102, 2023, 2003, 2193, 1016, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
- The tokenizer output:
input_ids
,attention_mask
, but alsotoken_type_ids
. - This feature tells us that the tokenizer is aware that we are dealing with the two sentences, each is represented by either
0
or1
input.token_type_ids
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
- If we convert each
input_ids
back to token we can have and idea of what happend:
input['input_ids']) tokenizer.convert_ids_to_tokens(
['[CLS]',
'this',
'is',
'the',
'first',
'sentence',
'[SEP]',
'this',
'is',
'number',
'2',
'[SEP]']
- So we see the model expects the inputs to be of the form [CLS] sentence1 [SEP] sentence2 [SEP]
- Note that not all model’s tokenizer can perform this because the way each model is trained, here
BERT
have seen pairs and knows how to deal with them. - We can then pass pairs of sentences to the tokenizer like this:
= tokenizer(train_ds['sentence1'], train_ds['sentence2'], truncation=True, padding=True) tokenized_dataset
- This way of tokenizing the whole dataset is not ideal since it requires huge RAM to store the dataset while we process it.
- It will also return dictionary keys:
attention_mask
,input_ids
,token_type_ids
and its values. - To work around this problem we will use
map()
method which will keep data as dataset, and also it will give us more flexibility if we need more preprocessing more than just tokenizing. map()
works by applying a function to each element of the dataset, let’s create a function that tokenize pairs of sentences so the map method use it over the whole dataset:
def func_tokenize(example):
return tokenizer(example['sentence1'], example['sentence2'], truncation=True)
- This function takes a dictionary (like the items of our dataset) and returns a new dictionary with the
keys input_ids
,attention_mask
, andtoken_type_ids
. - We didn’t include the
padding
here, because it’s not sufficient to pad the whole dataset based on the longest sentence, when we can do it on the batch level - We can pass the batching as argument in the
map()
method
= raw_ds.map(func_tokenize, batched=True) tokenized_datasets
Let’s take a look on pair exmaple from the training dataset:
We get what we expected, the 3 keys representing tokenization process, plus the dictionary key we already have:
label
,idx
andsentence1&2
:
'train'][55].keys() tokenized_datasets[
dict_keys(['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'])
- Now we have to deal with the padding since we decided to apply it o the batch-level, so each batch will have its own longest sequence to pad on.
- So we need to do a process called: Dynamic Padding.
Dynamic Padding:
- Putting the samples together in a single batch is done throught a function called:
Collate function
. - Collate function convert our samples to Pytorch tensors and concatenate them.
- But this can’t be done without padding, otherwise we will get different shapes for tensors.
- As we said before the padding process should be done on batch level, which means each batch will have its samples padded according to the longest sequence otherwise we will get samples a with lot of paddings.
- In practice we have to define a collate-function that apply the correct amount of padding to the items of the dataset we want to batch together.
from transformers import DataCollatorWithPadding
= DataCollatorWithPadding(tokenizer=tokenizer) data_collator
- Here we test this collate function on some samples from training set.
- We need first to remove columns
idx
,sentence1
,sentence2
since we don’t need them. - Let’s have a look at the length of each entry in the batch:
= tokenized_datasets['train'][:8]
samples = {k: v for k, v in samples.items() if k not in ['idx', 'sentence1', 'sentence2']}
samples len(x) for x in samples['input_ids']]
[
[50, 59, 47, 67, 59, 50, 62, 32]
samples.keys()
dict_keys(['label', 'input_ids', 'token_type_ids', 'attention_mask'])
- These samples are varying between
32
and67
, so our job here is to pad all the other sequence in this particular in respect to the treshold.
= data_collator(samples)
sample_batch for k, v in sample_batch.items()} {k:v.shape
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'input_ids': torch.Size([8, 67]),
'token_type_ids': torch.Size([8, 67]),
'attention_mask': torch.Size([8, 67]),
'labels': torch.Size([8])}
- Let’s check again if our
input_ids
have the same length:
len(i) for i in sample_batch['input_ids']] [
[67, 67, 67, 67, 67, 67, 67, 67]
Fine-tuning a model with the Trainer API:
- We can use the
Trainer
class to fine-tune any pretrained model on our dataset. - Now we need to prepare the enviroment for
Train.train()
which will be done on GPU. - But first we have to define
TrainingArguments
that contains all the Hyperparameters theTrainer
will use for the training and evaluation. - We just need to provide where the model will be saved, as long as the checkpoints, everything else is set as defaults which will work for learning purposes.
= 'my_folder' output_dir
from transformers import TrainingArguments
= TrainingArguments(output_dir=output_dir) training_args
- For this dataset we will use
AutoModelForSequenceClassification
class with 2 labels:
from transformers import AutoModelForSequenceClassification
= AutoModelForSequenceClassification.from_pretrained(mdl_ckp, num_labels=2) model
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
- The BERT model we instantiated now will be the back-bone for our process, but we will delete the head and add the
AutoModelForSequenceClassification
head that fit our situation. - The weights are initialized randomly for the head, which means we need to train them from scratch which is exactly what we will do:
from transformers import Trainer
= Trainer(
trainer
model,
training_args,=tokenized_datasets["train"],
train_dataset=tokenized_datasets["validation"],
eval_dataset=data_collator,
data_collator=tokenizer
tokenizer )
- When the tokenizer is passed as argument like this, usually we won’t need to define
data_collator
since it already defined withDataCollatorWithPadding
, and theTrainer
will retrieve it from the tokenizer anyway. - Now we fine-tune the model on our dataset:
trainer.train()
[1377/1377 03:21, Epoch 3/3]
Step | Training Loss |
---|---|
500 | 0.518800 |
1000 | 0.272700 |
TrainOutput(global_step=1377, training_loss=0.3197195403543, metrics={'train_runtime': 204.8739, 'train_samples_per_second': 53.711, 'train_steps_per_second': 6.721, 'total_flos': 405626802939840.0, 'train_loss': 0.3197195403543, 'epoch': 3.0})
- One thing we didn’t include in our Trainer is the evaluating strategy.
- We don’t have an idea about how good or bad our nodel because:
- We didn’t tell the
Trainer
to evaluate during training by settingevaluation_strategy
to either “steps” (evaluate every eval_steps) or “epoch” (evaluate at the end of each epoch). - We didn’t provide the
Trainer
with acompute_metrics()
function to calculate a metric during said evaluation (otherwise the evaluation would just have printed the loss, which is not a very intuitive number).
- We didn’t tell the
Evaluation:
- First we need to build a
compute_metrics()
function in order to use it in the next training. - The function takes a
EvalPrediction
object as argument, which is basically a named tuple with:predictions
fieldlabel_ids
field
- Here we get some predictions from our model:
= trainer.predict(tokenized_datasets["validation"])
predictions print(predictions.predictions.shape, predictions.label_ids.shape)
(408, 2) (408,)
- The output of the
predict()
method is another named tuple with three fields:predictions
,label_ids
, andmetrics
. - Metrics represent the loss on the dataset, as well as the time metrics, how much it takes the predictions on total average.
- As we see here, predictions is a two-dimensional array with shape 408 x 2 (408 being the number of elements in the dataset we used).
- It represent the logits for each element of the dataset we passed to
predict()
. - To transform them into predictions that we can compare to our labels, we need to take the index with the maximum value on the second axis:
import numpy as np
= np.argmax(predictions.predictions, axis=-1) preds
- Now we have ot build our
compute_metrics()
function in order to compare the predictions of the model with the actual labels. - We will use the metrics from the huggingFace library **’evaluate
**, we just need to loa the meetrics associated with the dataset we used
mrpc`:
import evaluate
= evaluate.load('glue', 'mrpc')
metric = preds, references= predictions.label_ids) metric.compute(predictions
{'accuracy': 0.8529411764705882, 'f1': 0.8993288590604027}
- Now let’s wrap everething in a single function:
def compute_metrics(eval_preds):
= evaluate.load('glue', 'mrpc')
metric = eval_preds
logits, labels = np.argmax(logits, axis= -1)
predictions return metric.compute(predictions= predictions, references= labels)
- Now let’s use it the training loop so it report all the metrics at the end of each epoch:
= TrainingArguments('test-train', evaluation_strategy= 'epoch')
training_args = Trainer(
trainer
model,
training_args,= tokenized_datasets['train'],
train_dataset= tokenized_datasets['validation'],
eval_dataset= data_collator,
data_collator= tokenizer,
tokenizer= compute_metrics
compute_metrics )
trainer.train()
[1377/1377 03:39, Epoch 3/3]
Epoch | Training Loss | Validation Loss | Accuracy | F1 |
---|---|---|---|---|
1 | No log | 0.681234 | 0.843137 | 0.893333 |
2 | 0.162700 | 0.998050 | 0.840686 | 0.892562 |
3 | 0.051500 | 1.032399 | 0.852941 | 0.898305 |
TrainOutput(global_step=1377, training_loss=0.08403856129254997, metrics={'train_runtime': 219.6964, 'train_samples_per_second': 50.087, 'train_steps_per_second': 6.268, 'total_flos': 405626802939840.0, 'train_loss': 0.08403856129254997, 'epoch': 3.0})
- In this section we fine-tuned a model on a dataset by using the
Trainer
API, which minimize the work we have to since it works out of the box. Trainer
can be used in most NLP tasks, but what if we need to do everything manualy by using pure Pytorch?- In next section we will build the same
Trainer
by hand.
# A full training
- In this section we will try to achieve the same results we had with
Trainer
API. - Since we already done with the preprocessing of the dataset, we just have to to some tweaks regarding some columns of the dataset we won’t use:
= tokenized_datasets.remove_columns(['sentence1', 'sentence2', 'idx'])
tokenized_ds = tokenized_ds.rename_column('label', 'labels')
tokenized_ds 'torch')
tokenized_ds.set_format('train'].column_names tokenized_ds[
['labels', 'input_ids', 'token_type_ids', 'attention_mask']
- Now we need to define the
DataLoader
that will help us to feed the model the dataset in batches:
import torch
from torch.utils.data import DataLoader
= DataLoader(tokenized_ds['train'], shuffle= True, batch_size= 8, collate_fn= data_collator)
train_dataloader = DataLoader(tokenized_ds['validation'], batch_size= 8, collate_fn= data_collator) eval_dataloader
- Check the training data loader:
for batch in train_dataloader:
break
for k,v in batch.items()} {k:v.shape
{'labels': torch.Size([8]),
'input_ids': torch.Size([8, 70]),
'token_type_ids': torch.Size([8, 70]),
'attention_mask': torch.Size([8, 70])}
- Make sure that everything will go smoothly during the training , we pass a batch to the model:
= AutoModelForSequenceClassification.from_pretrained(mdl_ckp, num_labels= 2)
model = model(**batch)
output output.loss, output.logits.shape
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(tensor(0.5792, grad_fn=<NllLossBackward0>), torch.Size([8, 2]))
- Everything looks fine, before we build the training loop we just need 2 things:
- optimizer
- learning-rate scheduler
- Sonce we want to replicate the
Trainer
and its defaults parameters, the optimizer used byTrainer
is Adam, we will use a slightly different optimizerAdamW
:
= AdamW(model.parameters(), lr= 5e-5) optimizer
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
- Finally the learning-rate scheduler used by the
Trainer
is just a simple linear decay from the highest learning-rate5e-5
to0
. - In order to define it we just need to know the number of training steps.
from transformers import get_scheduler
= 3
num_epochs = num_epochs * len(train_dataloader)
num_training_steps = get_scheduler(
lr_scheduler'linear',
= optimizer,
optimizer= 0,
num_warmup_steps = num_training_steps
num_training_steps )
- As we mension before,
Trainer
works out of the box with any set of hardware we have, but now we have to set the GPU as device during the training:
= torch.device('cuda') if torch.cuda.is_available else torch.device('cpu')
device model.to(device)
BertForSequenceClassification(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0-11): 12 x BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(dropout): Dropout(p=0.1, inplace=False)
(classifier): Linear(in_features=768, out_features=2, bias=True)
)
Training Loop:
- Now we have all the ingredients to start out training loop:
from tqdm.auto import tqdm
= tqdm(range(num_training_steps))
progress_bar
model.train()for epoch in range(num_epochs):
for batch in train_dataloader:
= {k:v.to(device) for k, v in batch.items()}
batch = model(**batch)
outputs = outputs.loss
loss
loss.backward()
optimizer.steps()
lr_scheduler.steps()
optimizer.zero_grad()1) progress_bar.update(
Evaluating Loop:
- Evaluation loop is basically the same as
compute_metrics()
function we built before:
= evaluate.load('glue', 'mrpc')
metric eval()
model.for batch in eval_dataloader:
= {k:v.to(device) for k, v in batch.itmes()}
batch with torch.no_grad():
= model(**batch)
outputs
= outputs.logits
logits = torch.argmax(logits, axis= -1)
predictions =predictions, references= batch['labels'] )
metric.add_batch(predictions metric.compute()
To recap, in this chapter we :
- Learned about datasets in the Hub
- Learned how to load and preprocess datasets, including using dynamic padding and collators
- Implemented your own fine-tuning and evaluation of a model
- Implemented a lower-level training loop