Hugging Face Course Notes: Chapter2

Hugging-Face
NLP
LLMs
Pytorch
Author

Ismail TG

Published

October 19, 2023

Introduction:

  • Transformer models are very large with Ms to 10s of Billions of parameters, which make the process of training and fine-tuning and deploying them very hard.
  • Here comes the Hugging Face library which adress that problem, the goal is to provide a single API through which any transformer model can be loaded, trained and saved.
  • With Transformer library we can: - Download, load and use models for inference or fine-tuning with just couple lines of code - all models in the library are stored like any other model, at their core they are just a simple pytorch nn.Module class. - All components of the models are stored in one file, so no abstarctions or shared modules across files

Behind the PipeLine:

  • To understand what’s happenening behind the scene we must first start with what already know: Pipeline
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier(['My birthday is today!'])
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
[{'label': 'POSITIVE', 'score': 0.9996117949485779}]
  • As we saw in the previous chapter the pipeline goups 3 steps in order to perform such a task:
    pipeline

Preprocessing with a Tokenizer:

  • In order to convert raw text to its numerical form before we feed it to the model, we use Tokenizer.
  • Here is how we tokenize any input words:
from transformers import AutoTokenizer
mdl_ckpt = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(mdl_ckpt)
inputs = 'My birthday is today!'
outputs = tokenizer(inputs, padding=True, truncation=True, return_tensors='pt')
outputs
{'input_ids': tensor([[ 101, 2026, 5798, 2003, 2651,  999,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
  • First we pick a model distilbert-base-uncased-finetuned-sst-2-english which is basically the same model our pipeline used to classify the sentence.
  • We use AutoTokenizer to get to tokenization method according to that model, because each model has its own method of tokenizing words.
  • Then we feed the text to the tokenizer, and we pick which type of tensors we want to get returned
    • pt stands for pytorch
    • other parameters will be covered later
  • We get a dictionary with 2 keys: input_ids and attention_mask
  • attention_mask will be covered later, input_ids contains one list of integers.

Going through the model:

  • We can download the pretraind model same we did with tokenizer, by usin AutoModel class which also has from_pretrained method.
  • We just need to download the same model as used in tokenization process.
from transformers import AutoModel
model = AutoModel.from_pretrained(mdl_ckpt)
  • This architecture we just downloaded conatins onlly the base transformer module: given some inputs, it outputs what we call Hidden_state.
  • For each model inputs we will retrieve a high-dimensional vector representing the contextual understanding of that input by the model
  • These Hidden_states can be used as it is, but usually it will be feeded as input to another part of the model called the Head.
  • Each Head is a task_specific head.

A high-dimensional vector?

  • Usually the model outputs a large vector with 3 dimensions:
    • Batch-size: the number of sequence processed (in our case we pass only one sentence)
    • Sequence-length: The length of the numerical representation of the sequence (8 in our example)
    • Hidden size: The vector dimension of each model input.
  • The high-dimentionality of this vector comes from the last dimension, the hidden-size is very large dimension: usually ~700:
outs = model(**outputs)
outs.last_hidden_state.shape
torch.Size([1, 7, 768])

Model heads: Making sense out of numbers:

  • So to wrap-up the whole process: First get inputs converted input ID then the embedding layer convert them into tokenized vectors.
  • The subsequent layers manipulate thes vectors using attention mechanism to produce a contextual understanding of that input in form of High-dimensional-vector.

model
  • There rae many architecture available in the Transformers library, each is designed to tackle specific task.
  • For example if we want a model for a sequence classification head, we will use AutoModelForSequenceClassification instead of AutoModel.
text = ['do you feel any better today?', 'I feel warm and cosy in my house']
tokenizer = AutoTokenizer.from_pretrained(mdl_ckpt)
inps = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
from transformers import AutoModelForSequenceClassification
mdl_ckpt = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(mdl_ckpt)
outs = model(**inps)
outs
SequenceClassifierOutput(loss=None, logits=tensor([[-0.2121,  0.4987],
        [-3.9382,  4.1996]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
outs.logits.shape
torch.Size([2, 2])
  • In this case we have 2 sentences and 2 labels negative positive.
  • The model will take the high dimensional vector as input and outputs a vector that match our task.

Post processing:

  • The vector we get doesn’t make any sense as it is, so we need to make it meaningful for our task.
outs.logits
tensor([[-0.2121,  0.4987],
        [-3.9382,  4.1996]], grad_fn=<AddmmBackward0>)
  • those are prediction for each sentence, and each prediction can be mapped to a label, so we need to know each label which, then convert those logits into some meaningful values.
  • To convert the logits into probabilies we will pass them through a softmax layer.
import torch
preds = torch.nn.functional.softmax(outs.logits, dim=-1)
preds
tensor([[3.2942e-01, 6.7058e-01],
        [2.9218e-04, 9.9971e-01]], grad_fn=<SoftmaxBackward0>)
  • Now we need to know the label of each colomn:
model.config.id2label
{0: 'NEGATIVE', 1: 'POSITIVE'}
  • So the position [0] is negative where the position [1] positive

Models

  • As we saw before the AutoModel() class is handy tool to instantiate a model from a chekcpoint(weights)
  • It can guess the correspondent architecture for the checkpoint.

Building the transformer:

  • We also could call the class of the model precisely if we know exactly the model we want to use.
from transformers import BertConfig, BertModel
cnfg = BertConfig()
mdl = BertModel(cnfg)
  • The configurations contains many attributes related the architecture:
cnfg
BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.34.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}
  • We can understand many of these attributes like:
    • hidden_act: activation function : gelu
    • hidden_size: vector dimensions of each input word
    • attention_head, num_hidden_layer, model_type
  • While it is possible to build model like this way and using it, but it will produce very low predictions beacause its weights are set randomly.
  • This forces us to train it from scratch, which is a very daunting and time, noney, energy consuming process.
  • This is way its very preferably to use to other way of loading the model by starting with a pretrained one:
from transformers import BertModel
mdl = BertModel.from_pretrained('bert-base-cased')
  • we even could use AutoModel instead of BertModel since it will produce agnostic code that fits all situations

  • At this point the model is initialized with all the weights of the checkpoint, it can be used for inference directly on the tasks it was trained on, and also it can be fine-tuned on new tasks or more data.

Saving the model:

  • To save a model we are satisfied with its prformance:
mdl.save_pretrained('path')
!ls 'path'
config.json  pytorch_model.bin  special_tokens_map.json  tokenizer_config.json  vocab.txt
  • This saves 2 files:
    • config.json: contains all attributes necessary to build the model architecture, and also it contains some metadata
    • pytorch_model.bin: contains the learnable weights.

Using a Transformer model for inference:

  • Tokenizer convert input words into input ID:
sequences = ["Hello!", "Cool.", "Nice!"]
inps = tokenizer(sequences)
encoded_sequences = inps.input_ids
encoded_sequences
[[101, 7592, 999, 102], [101, 4658, 1012, 102], [101, 3835, 999, 102]]
  • The output we get here is a list of list, the problem is that tensors accept only rectangular shapes.
  • So we nee to cenvert it into the targeted shape:
input = torch.tensor(encoded_sequences)
input
tensor([[ 101, 7592,  999,  102],
        [ 101, 4658, 1012,  102],
        [ 101, 3835,  999,  102]])

Using the tensors as inputs to the model

  • Making use of this returned tensor is easy as pass it through the model:
outputs= mdl(input)
outputs
BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 4.4496e-01,  4.8276e-01,  2.7797e-01,  ..., -5.4032e-02,
           3.9394e-01, -9.4770e-02],
         [ 2.4943e-01, -4.4093e-01,  8.1772e-01,  ..., -3.1917e-01,
           2.2992e-01, -4.1172e-02],
         [ 1.3668e-01,  2.2518e-01,  1.4502e-01,  ..., -4.6915e-02,
           2.8224e-01,  7.5566e-02],
         [ 1.1789e+00,  1.6738e-01, -1.8187e-01,  ...,  2.4671e-01,
           1.0441e+00, -6.1970e-03]],

        [[ 3.6436e-01,  3.2464e-02,  2.0258e-01,  ...,  6.0111e-02,
           3.2451e-01, -2.0995e-02],
         [ 7.1866e-01, -4.8725e-01,  5.1740e-01,  ..., -4.4012e-01,
           1.4553e-01, -3.7545e-02],
         [ 3.3223e-01, -2.3271e-01,  9.4877e-02,  ..., -2.5268e-01,
           3.2172e-01,  8.1079e-04],
         [ 1.2523e+00,  3.5754e-01, -5.1320e-02,  ..., -3.7840e-01,
           1.0526e+00, -5.6255e-01]],

        [[ 2.4042e-01,  1.4718e-01,  1.2110e-01,  ...,  7.6062e-02,
           3.3564e-01,  2.8262e-01],
         [ 6.5701e-01, -3.2787e-01,  2.4968e-01,  ..., -2.5920e-01,
           2.0175e-01,  3.3275e-01],
         [ 2.0160e-01,  1.5783e-01,  9.8974e-03,  ..., -3.8850e-01,
           4.1308e-01,  3.9732e-01],
         [ 1.0175e+00,  6.4387e-01, -7.8147e-01,  ..., -4.2109e-01,
           1.0925e+00, -4.8456e-02]]], grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.6856,  0.5262,  1.0000,  ...,  1.0000, -0.6112,  0.9971],
        [-0.6055,  0.4997,  0.9998,  ...,  0.9999, -0.6753,  0.9769],
        [-0.7702,  0.5447,  0.9999,  ...,  1.0000, -0.4655,  0.9894]],
       grad_fn=<TanhBackward0>), hidden_states=None, past_key_values=None, attentions=None, cross_attentions=None)

Tokenization:

  • One of the core component of the NLP Pipeline is the idea of Tokenezation
  • It serve the purpose of translating raw text to its numerical form, so the model can process
  • In general there’s 3 approches to tokenize raw text, we won’t get in much details:
    • Word-based: each word has its own token
    • Character-Based: split text into charchters and gives each one of them a unique token
    • Sub-Words Tokenization: basically gives frequent words unique tokenm while split rare words into sub parts and tokenize them.

Loading and Saving:

  • Loading and saving tokenizer is simple and very similar to how we load and save Models.
  • By using the same two methods: from_pretrained and save_pretrained.
  • Also we can load the tokenizer either by calling tokenizer class BertTokenizer or by just using AutoTokenizer, same as how we load models: AutoModel or BertModel:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
# or the easy way
tokenizer1 = AutoTokenizer.from_pretrained('bert-base-cased')
text = 'try to tokenize this text so I can see the difference between them!'
inp = tokenizer(text, return_tensors='pt')
inp1 = tokenizer1(text, return_tensors='pt')
inp.input_ids, inp1.input_ids
(tensor([[  101,  2222,  1106, 22559,  3708,  1142,  3087,  1177,   146,  1169,
           1267,  1103,  3719,  1206,  1172,   106,   102]]),
 tensor([[  101,  2222,  1106, 22559,  3708,  1142,  3087,  1177,   146,  1169,
           1267,  1103,  3719,  1206,  1172,   106,   102]]))
  • So both methods yield same results, but as we said before we prefer using the second method since its code agnostinc and can be applied with all model.

  • Saving tokenizer is also similar to how we save models:

tokenizer.save_pretrained('path')
('path/tokenizer_config.json',
 'path/special_tokens_map.json',
 'path/vocab.txt',
 'path/added_tokens.json')

Encoding:

  • The process of translating words to numbers is called encoding.
  • The encoding is done through 2 steps:
    • Tokenization
    • conversion to input IDs

Tokenizer * The first we create token-word, ,ostly complete words, but in some cases the one word will be splited to 2 or more parts. - this sub parts can be dentified by the ## preffix. * Then we need to convert those tokens into input IDs in order to feed them to the model. * To do that the tokenizer pass this tokens through a Vocabulary. - When we instentiate the tokenizer with from_pretrained() we already download that vocabulary the we can match ewach token against an ID. - we need to use the same checkpoint during the training. * Here we will explore these 2 steps seperatly:

text  = 'Using a transformer network is simple'
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
tokens = tokenizer.tokenize(text)
tokens
['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']
  • As we see the word transformer get splited to 2 parts, the second one is represented with ## prefix
ids = tokenizer.convert_tokens_to_ids(tokens)
ids
[7993, 170, 11303, 1200, 2443, 1110, 3014]

Decoding:

  • The idea of decoding is the exact opsite of encoding, its the process of converting ids to their text/word form
decoded_ids = tokenizer.decode(ids)
decoded_ids
'Using a transformer network is simple'
  • We get the text we begin with back by using the decode() method.

Handling multiple sequences:

  • Performing inference on a single sequence is a simple task, but what if we want to process multiple sequence wuth different lengths?
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# model checkpoint
mdl_chkpt = 'distilbert-base-uncased-finetuned-sst-2-english'
# tokenizer
tokenizer = AutoTokenizer.from_pretrained(mdl_ckpt)
# model
model = AutoModelForSequenceClassification.from_pretrained(mdl_ckpt)
# sequence
text = "I've been waiting for a HuggingFace course my whole life."
#tokens
tokens = tokenizer.tokenize(text)
# ids
ids = tokenizer.convert_tokens_to_ids(tokens)
# convert them into tensor
input_ids = torch.tensor(ids)
  • Now we can pass the tensor into the model to perform some prediction:
preds = model(input_ids)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-68-9aae2e9c06e9> in <cell line: 1>()
----> 1 preds = model(input_ids)

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1516             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517         else:
-> 1518             return self._call_impl(*args, **kwargs)
   1519 
   1520     def _call_impl(self, *args, **kwargs):

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1525                 or _global_backward_pre_hooks or _global_backward_hooks
   1526                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527             return forward_call(*args, **kwargs)
   1528 
   1529         try:

/usr/local/lib/python3.10/dist-packages/transformers/models/distilbert/modeling_distilbert.py in forward(self, input_ids, attention_mask, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
    787         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    788 
--> 789         distilbert_output = self.distilbert(
    790             input_ids=input_ids,
    791             attention_mask=attention_mask,

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1516             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517         else:
-> 1518             return self._call_impl(*args, **kwargs)
   1519 
   1520     def _call_impl(self, *args, **kwargs):

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1525                 or _global_backward_pre_hooks or _global_backward_hooks
   1526                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527             return forward_call(*args, **kwargs)
   1528 
   1529         try:

/usr/local/lib/python3.10/dist-packages/transformers/models/distilbert/modeling_distilbert.py in forward(self, input_ids, attention_mask, head_mask, inputs_embeds, output_attentions, output_hidden_states, return_dict)
    590             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
    591         elif input_ids is not None:
--> 592             self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
    593             input_shape = input_ids.size()
    594         elif inputs_embeds is not None:

/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py in warn_if_padding_and_no_attention_mask(self, input_ids, attention_mask)
   3939 
   3940         # Check only the first and last input IDs to reduce overhead.
-> 3941         if self.config.pad_token_id in input_ids[:, [-1, 0]]:
   3942             warn_string = (
   3943                 "We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See "

IndexError: too many indices for tensor of dimension 1
  • We did same steps as before but we get an error?
  • If we look again to what the tokenizer output we see that it didn’t just return a tensor, but it added a dimension of top of it:
seq = 'this is just a simple example to see the difference between the 2!'
normal_input = tokenizer(seq, return_tensors='pt')
the_tokens = tokenizer.tokenize(seq)
ids_1 = tokenizer.convert_tokens_to_ids(the_tokens)
inp_ids = torch.tensor(ids_1)
print(f'this is normal way: {normal_input.input_ids}')
print(f'this is the manual way: {inp_ids}')
this is normal way: tensor([[ 101, 2023, 2003, 2074, 1037, 3722, 2742, 2000, 2156, 1996, 4489, 2090,
         1996, 1016,  999,  102]])
this is the manual way: tensor([2023, 2003, 2074, 1037, 3722, 2742, 2000, 2156, 1996, 4489, 2090, 1996,
        1016,  999])
  • The normal way has one more dimension that the manual way, so we need to add this dimension while converting to tensor in order to feed it to the model
input_ids= torch.tensor([ids])
input_ids
tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
  • Now let’s try again with the model:
preds = model(input_ids)
preds.logits
tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)
  • The thing we need to consider here the importance to Batching when feeding input sentences to tthe model even if we have one single sentence:
batched_ids = [ids, ids]
batched_inp_ids = torch.tensor(batched_ids)
outs = model(batched_inp_ids).logits
outs
tensor([[-2.7276,  2.8789],
        [-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)
  • batching allows the model to process multiple sentences.
  • What about if these sentences has different length? which is always the case.
  • As we know the tensors needs to be rectangular in order to conver list of IDs to tensor.
  • to solve this kind of problems and more we will go through some techninques to work around them:

Padding:

  • Suppose we have this list of list as ids:
batched_ids = [[200, 200], [200, 200, 100]]
  • We need to get this list of list in rectangular shape before we convert it into tensor.
  • This is where we use padding
padding_id = 100
batched_ids = [[200, 200, padding_id], [200, 200, 100]]
  • The padding token ID can be found in tokenizer.pad_token_id.
  • Now let’s do a simple experience to see differences between 3 batches after going through a model: - first list - second list - both lists batched and padding applied
model = AutoModelForSequenceClassification.from_pretrained(mdl_chkpt)
sequence1_ids = [[200, 200]]
sequence2_ids = [[200, 200, 200]]
batched_ids = [[200, 200,tokenizer.pad_token_id ], [200, 200, 200]]
print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 1.3374, -1.2163],
        [ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
  • The thing we could observe here is that we get same logits for the second sequence ID compared with batched_ids but not the first one?
  • The first sequence id is where we applied padding, and as we know transformer models are very sensetive to any context of the words (in this case context of the ids) so the element that we added in order to get a rectangular shape, is also get computed by the transformer model, which influence the final prediction.
  • We need to tell the model to ignore these padding values during the computation

Attention mask:

  • Attention mask is what tell the model during the predecting phase to ignore padding values and not including them while computing the attention mechanism
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)
tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
  • Now we get the same logits

Longer sequences:

  • Transformer model cannot handle very long sentences, usually they have between 512 and 1024 tokens as maximum length for a sentence.
  • I we have a situation where we need to deal with very large sequence we either:
    • use models that can handle long sentences
    • use truncation method
  • Truncation is a way of making sequences of the same batch the same length, either by picking the length of the longest sequence or the short one

Putting All Together:

  • What we have done till now is hard-coding each step of the tokenization process without full help from the Tokenizer.
  • However as we saw before the Transformer API can handle all this kind of work with a high-level functions.
mdl_ckpt = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(mdl_ckpt)
seq = "I've been waiting for a HuggingFace course my whole life."
  • We can call the tokenizer function directly on a sequence and get:
inputs = tokenizer(seq)
inputs
{'input_ids': [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
  • We get input IDs and even the attention_mask is applied.
  • We can also define more feature we want our tokenizer to apply:
# pass multiple sentences:
seqs = ['this is sentence number one', 'this is the second']
inps = tokenizer(seqs)
inps
{'input_ids': [[101, 2023, 2003, 6251, 2193, 2028, 102], [101, 2023, 2003, 1996, 2117, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}
  • Padd on different parameters:
# padd based on the longest sequence
inps = tokenizer(seqs, padding='longest')

# padd based on the model max length
inps = tokenizer(seqs, padding='max_length')

# padd on specified length

inps = tokenizer(seqs, padding='max_length', max_length = 8)
# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
inps = tokenizer(seqs, truncation=True)
inps
{'input_ids': [[101, 2023, 2003, 6251, 2193, 2028, 102], [101, 2023, 2003, 1996, 2117, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}
# Will truncate the sequences that are longer than the specified max length
inps = tokenizer(seqs, max_length=7, truncation=True)
inps
{'input_ids': [[101, 2023, 2003, 6251, 2193, 2028, 102], [101, 2023, 2003, 1996, 2117, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}
  • The tokenizer object can handle the conversion to specific framework tensors, which can then be directly sent to the model==>pt for pytorch
  • The padding in this case should be always set as True
from transformers.models.instructblip.modeling_instructblip import InstructBlipQFormerSelfOutput
inps = tokenizer(seqs, return_tensors='pt', padding=True)
inps
{'input_ids': tensor([[ 101, 2023, 2003, 6251, 2193, 2028,  102],
        [ 101, 2023, 2003, 1996, 2117,  102,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0]])}

Special Tokens

  • If we look closely to the input ID’s we get, we can spot a small difference from what we got earlier
  • The tokenizer added 2 ID’s to the list, one in the begining and another at the end.
    • they alwayas have the same value: 101 and 102
seq = "I've been waiting for a HuggingFace course my whole life."
toks1 = tokenizer(seq)
toks2 = tokenizer.tokenize(seq)
ids1= toks1.input_ids
ids2 = tokenizer.convert_tokens_to_ids(toks2)
print(f'normal: {ids1}')
print(f'hard_coded: {ids2}')
normal: [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
hard_coded: [1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]
  • Now we will decode the different types of ID’s we will get different decoded sentences:
print(f'normal: {tokenizer.decode(ids1)}')
print(f'hard-coded: {tokenizer.decode(ids2)}')
normal: [CLS] i've been waiting for a huggingface course my whole life. [SEP]
hard-coded: i've been waiting for a huggingface course my whole life.
  • Tokenizer added special tokens to the sentence [CLS] and [SEP], because the model was trained with this kind of architecture

Wrapping Up: From Tokenizer to Model:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
outs = output.logits
outs
tensor([[-1.5607,  1.6123],
        [-3.6183,  3.9137]], grad_fn=<AddmmBackward0>)