Chapter 5: Deep learning for coders with fastai and pytorch

Fastai
Pytorch
Numpy
Pandas
Deep Learning
Author

Ismail TG

Published

October 26, 2022

Image Classification

  • Since we now familiar the whole process of creating deep learning model, using pre-built model, building them from scratch, handling data, and putting these model into web apps, we will now to go deeper and keep focus on details that make model accurate and reliable.
  • It takes many tweaks and parameters changing in order to “polish” a model.
  • In order to achieve this goal we need to be familiar with many concepts and technics, different types of layers, regularization methods, optimizers, how to put layers together into architectures, labeling techniques, and much more.
! [ -e /content ] && pip install -Uqq fastbook
import fastbook
fastbook.setup_book()
     |████████████████████████████████| 719 kB 30.4 MB/s 
     |████████████████████████████████| 1.3 MB 60.3 MB/s 
     |████████████████████████████████| 5.3 MB 53.0 MB/s 
     |████████████████████████████████| 441 kB 66.2 MB/s 
     |████████████████████████████████| 1.6 MB 54.6 MB/s 
     |████████████████████████████████| 115 kB 75.5 MB/s 
     |████████████████████████████████| 163 kB 67.3 MB/s 
     |████████████████████████████████| 212 kB 71.5 MB/s 
     |████████████████████████████████| 127 kB 76.1 MB/s 
     |████████████████████████████████| 115 kB 62.1 MB/s 
     |████████████████████████████████| 7.6 MB 58.8 MB/s 
Mounted at /content/gdrive
from fastbook import *
from fastai.vision.widgets import *

From Dogs and Cats to Pet Breeds

  • In real world scenarios, the first thing we do is we get in contact with data, usualy at this phase we know nothing about the dataset. We then start to look how to extract the data we want from it, and what the data looks like, and how it is structured.

  • Usually data is provided in one of two ways:

    • Individual files representing items of data, possibly organized into folder or with filenames representing information about those items
      • text documents
      • images
    • A table of data in which each row is an item and may include filenames providing connections between the data in the table and data in other formats
      • CSV files
  • Exceptions:

    • Domains like Genomics
    • binary database formats
    • network streams
# download the dataset
from fastai.vision.all import *
path = untar_data(URLs.PETS)
100.00% [811712512/811706944 01:02<00:00]
#get the path as variable, and see what inside
Path.BASE_PATH = path
path.ls()
(#2) [Path('images'),Path('annotations')]
  • As we notice here, the data is provided with 2 directories:
    • images
    • annotations
#take a look at what inside the images directory
(path/'images').ls()
(#7393) [Path('images/german_shorthaired_87.jpg'),Path('images/Russian_Blue_200.jpg'),Path('images/Siamese_25.jpg'),Path('images/japanese_chin_48.jpg'),Path('images/miniature_pinscher_161.jpg'),Path('images/pug_150.jpg'),Path('images/pug_120.jpg'),Path('images/pug_63.jpg'),Path('images/samoyed_143.jpg'),Path('images/yorkshire_terrier_190.jpg')...]
  • When we took a look at these names, we see some paterns: we already know from chapter1 that cats name are uppercase, here we see that after the breed’s name there is a (_) then a number, and finally the extension.
  • This may help us to write some code that extract the breed from a single Path.
#pick one 
fname = (path/"images").ls()[0]
fname
Path('images/german_shorthaired_87.jpg')
  • The best way to work with strings and extract patterns from them is to use Regex, which stands for Regular Expression.
re.findall(r'(.+)_\d+.jpg$', fname.name)
['german_shorthaired']
  • Now we need to label the whole dataset using this code.
  • Fastai comes with many classes for labeling, in this case when we need to label with help of regex we could use RegexLabeller class within DataBlaock API.
pets = DataBlock(blocks = (ImageBlock, CategoryBlock),
                 get_items=get_image_files, 
                 splitter=RandomSplitter(seed=42),
                 get_y=using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'),
                 item_tfms=Resize(460),
                 batch_tfms=aug_transforms(size=224, min_scale=0.75))
dls = pets.dataloaders(path/"images")

Presizing

  • Fastai has this method of presizing the images in a way that conserve its quality after the data augmentation, so it helps the model to learn more lessons from data, and also it helps our dataset to be more varies
  • The idea behind the presizing is we crop the image and resize it to 460 by 460 first, which is a big size by deep learning norms, this operation is done on CPU, then we do the data augmentation in batches, by cropping a rotated random part of that 460^2 image, and taking the cropped image then resize again to a 224 by 224 image, all this operation are done on batch level, which mean on GPU.**
item_tfms=Resize(460),  
batch_tfms=aug_transforms(size=224, min_scale=0.75))

![REsize_method](1.png)

* Usually all the augmentation operations we do on a image reduce the quality of the image, but with this approach we could say we can preserve big part of informations of that image so the model can learn better 

### Checking and Debugging a DataBlock

* `DataBlock` is just a blueprint for orginizing data before we feed it to the model, you have no guarantee that your template is going to work on your data source as you intend.
* So, before training a model you should always check your data. You can do this using the show_batch method:

::: {#cell-23 .cell outputId='0f1711e2-f2c6-4703-ed64-1ea456f04a2d' execution_count=10}
``` {.python .cell-code}
dls.show_batch(nrows=1, ncols=4 ,unique= True)

:::

  • In case we made a mistake in the process of creating datablock, we could use .summary to track the problem.Here we didn’t resize the images in one formm, so couldn’t use the batch transform

  • As we see here the .summary gives us precise diagnostic of the problem:**

    • at least two tensors in the batch are not the same size.

Clean the data with the model!!

  • Once we feel like the datablock is well created, we better begin train the model, and use it as a tool of cleaning the data. If there’s a problem with data or the model, we better know that before we lost lot of time and energy on data cleaning even before testing the model.
# train the model
learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(2)
/usr/local/lib/python3.7/dist-packages/torchvision/models/_utils.py:209: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "
/usr/local/lib/python3.7/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /root/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
epoch train_loss valid_loss error_rate time
0 1.526362 0.348364 0.111637 01:11
epoch train_loss valid_loss error_rate time
0 0.482150 0.309586 0.093369 01:16
1 0.327810 0.229392 0.064953 01:17

Use the model to clean the data

  • As we saw before in Chapter 2 the best tool to clean the data is basically the model itself

  • After creating the datablock and dataloader we better train the model and get some feedback so we know if something is wrong very early, and if not we start to use the model as tool to investigate the data

  • Usually before we train the model we have to decide the function that will update the parameters, a Loss Function. But here we didn’t create any loss function?

  • If we didn’t decide the way by which we update the paramters, Fastai by default will chose a loss function for us.

    • the chosen loss function will suite the kind of model we build, and the type of dataset we have.
# check the loss function
learn.loss_func
FlattenedLoss of CrossEntropyLoss()
  • The loss function used to train this model is Cross Entropy Loss

Cross-Entropy Loss

  • The Cross-Entropy Loss is a function similar to what we saw in the previous chapter, when we created the mnist_loss function:
def mnist_loss(predictions, targets):
    prediction = prediction.sigmoid()
    return torch.where(targets==1, 1-prediction, predictions).mean()
  • The problem with this function is, it only takes 2 categories(3, 7) but here we have 37 types of breeds.but here we have multiple classes.
    • it can takes more than 2 categories

Viewing activations and labels

  • In order to understand the cross-entropy loss, let’s grab a batch of data
x,y = dls.one_batch()
  • It return the activations of dependent and independent variable of one mini-batch
# independent variable
y, len(y)
(TensorCategory([ 8, 34,  6, 33, 12, 15, 32, 14,  5, 27, 20, 32, 18, 24,  8, 29,  5,  4,  0, 28, 10, 12, 16, 25, 29,  3, 34, 27, 30, 15,  6, 15, 27, 34, 14, 21,  5, 17, 31, 26, 13, 35, 17, 35, 23, 14,
                 35, 35,  8,  7, 21,  0, 22, 17, 19, 26, 16,  5, 15, 27, 11, 22, 34, 18], device='cuda:0'),
 64)
  • It return 64 number, each represent on of the 37 breeds index
# all indepent variables
dls.vocab, len(dls.vocab)
(['Abyssinian', 'Bengal', 'Birman', 'Bombay', 'British_Shorthair', 'Egyptian_Mau', 'Maine_Coon', 'Persian', 'Ragdoll', 'Russian_Blue', 'Siamese', 'Sphynx', 'american_bulldog', 'american_pit_bull_terrier', 'basset_hound', 'beagle', 'boxer', 'chihuahua', 'english_cocker_spaniel', 'english_setter', 'german_shorthaired', 'great_pyrenees', 'havanese', 'japanese_chin', 'keeshond', 'leonberger', 'miniature_pinscher', 'newfoundland', 'pomeranian', 'pug', 'saint_bernard', 'samoyed', 'scottish_terrier', 'shiba_inu', 'staffordshire_bull_terrier', 'wheaten_terrier', 'yorkshire_terrier'],
 37)
  • Here we use get_preds to get the predictions for each image in the dataset or mini-batch (like in this example).
preds,_ = learn.get_preds(dl=[(x,y)])
# prediction for image [0] in the mini-batch
preds[0]
TensorBase([1.0919e-05, 6.0375e-08, 8.5680e-05, 7.1281e-07, 1.8754e-06, 4.7956e-07, 1.1708e-04, 1.4080e-06, 9.9972e-01, 9.4731e-07, 6.8318e-08, 1.7055e-07, 1.9338e-06, 7.7372e-07, 1.7931e-07,
            1.5261e-07, 1.1363e-08, 2.3353e-07, 1.3689e-07, 4.5361e-07, 7.1548e-08, 4.9425e-06, 1.0026e-05, 2.6908e-07, 3.5724e-07, 3.1201e-07, 1.0177e-08, 2.7219e-08, 1.6157e-06, 3.5337e-08,
            1.8094e-06, 3.3425e-05, 2.3514e-08, 3.0318e-06, 1.4910e-07, 1.9582e-07, 9.5147e-08])
  • The 37 predictions refer to the probability of each breed to match the image[0].
    • if we sum() them up they add up to 1:
preds[0].sum()
TensorBase(1.0000)
  • In order to transform the activations of our model into prediction, we use Soft-Max

SoftMax

  • As we said before, Softmax is similar to sigmoid function we use before, but it only can handel more than 2 classes.
# reminder of sigmoid
plot_function(torch.sigmoid, min=-4,max=4)

  • This function allow us to predict whether a activation number is pointing to each category of the two, by calculating which activation is big and by much. But in our case today we have 37 category, which means by this logic we need a activation for each one.
  • First let’s create a similar situation where we have only 2 categories, but we won’t solve it as it’s a binary problem (it's 3) but as 2 categories problem, each has it’s activation, and their probabilty sum up to 1.
torch.random.manual_seed(42);
acts = torch.randn((6,2))*2
acts
tensor([[ 0.6734,  0.2576],
        [ 0.4689,  0.4607],
        [-2.2457, -0.3727],
        [ 4.4164, -1.2760],
        [ 0.9233,  0.5347],
        [ 1.0698,  1.6187]])
  • We can’t just take the sigmoid of this directly, since we don’t get rows that add to 1 (i.e., we want the probability of being a 3 plus the probability of being a 7 to add up to 1):
sigmoid(acts)
tensor([[0.6623, 0.5641],
        [0.6151, 0.6132],
        [0.0957, 0.4079],
        [0.9881, 0.2182],
        [0.7157, 0.6306],
        [0.7446, 0.8346]])
  • Eevn though we try different approach to solve the same problem, we still have some similarities.
  • We will use sigmoid on each activation.
  • And we still need to substract an activation from another beacuse that represent how much the model sure about an image is assigned to each category, thats for first column.
  • In the second colun we just use 1 - prediction (activation of the second column)
diffs = acts[:, 0] - acts[:, 1]
# create the sigmoid function of both categories
sigm_ver = torch.stack([diffs.sigmoid(), 1-diffs.sigmoid()], dim=1)
  • We can express the softmax function as:
def softmax(x): return exp(x) / exp(x).sum(dim=1, keepdim=True)
  • Let’s check that softmax returns the same values as sigmoid for the first column, and those values subtracted from 1 for the second column:
sm_acts = torch.softmax(acts, dim=1)
sm_acts, sigm_ver
(tensor([[0.6025, 0.3975],
         [0.5021, 0.4979],
         [0.1332, 0.8668],
         [0.9966, 0.0034],
         [0.5959, 0.4041],
         [0.3661, 0.6339]]), tensor([[0.6025, 0.3975],
         [0.5021, 0.4979],
         [0.1332, 0.8668],
         [0.9966, 0.0034],
         [0.5959, 0.4041],
         [0.3661, 0.6339]]))
  • Softmax calculate the \(exp^{x}\) and divide it by sum \(exp^{x}\) of all activations of other categories.
    • the exp make sure the biggest activation is way bigger than others
    • dividing by the sum is what make softmax values add up to 1

Log Likelihood

  • In the previous chapter when we created mnis_loss, we used torch.where to select between the input and 1-input.
  • With softmax, we will use indexing.
# the pretended targets
targs = tensor([0, 1, 0, 1, 1, 0])
# create an index
idx = range(6)
# the softmax activations
sm_acts
tensor([[0.6025, 0.3975],
        [0.5021, 0.4979],
        [0.1332, 0.8668],
        [0.9966, 0.0034],
        [0.5959, 0.4041],
        [0.3661, 0.6339]])
  • Here we make the targs decide which activation we pick in each row
sm_acts[idx, targs]
tensor([0.6025, 0.4979, 0.1332, 0.0034, 0.4041, 0.3661])
  • let’s display what we just did :
from IPython.display import HTML
df = pd.DataFrame(sm_acts, columns=["3","7"])
df['targs'] = targs
df['idx'] = idx
df['result'] = sm_acts[range(6), targs]
t = df.style.hide_index()
#To have html code compatible with our script
html = t._repr_html_().split('</style>')[1]
html = re.sub(r'<table id="([^"]+)"\s*>', r'<table >', html)
display(HTML(html))
3 7 targs idx result
0.602469 0.397531 0 0 0.602469
0.502065 0.497935 1 1 0.497935
0.133188 0.866811 0 2 0.133188
0.996640 0.003360 1 3 0.003360
0.595949 0.404051 1 4 0.404051
0.366118 0.633882 0 5 0.366118
  • But idea here is not to use it in a simple binary problem, because torch.where could did the same job here, but is to use it in order to solve a multi-categorie problem

PyTorch provides a function that does exactly the same thing as sm_acts[range(n), targ] (except it takes the negative, because when applying the log afterward, we will have negative numbers), called nll_loss (NLL stands for negative log likelihood):

-sm_acts[idx, targs]
tensor([-0.6025, -0.4979, -0.1332, -0.0034, -0.4041, -0.3661])
F.nll_loss(sm_acts, targs, reduction='none')
tensor([-0.6025, -0.4979, -0.1332, -0.0034, -0.4041, -0.3661])

Taking the Log

cross-entropy-loss-function.png
  • The using of logarithms allow us to do all kind of multiplications without carring about the size of the output.
    • the nature of log functions make them increase lineary when the underlying signal increase exponentialy.
    • log(a*b) = log(a)+log(b)
    • the log of a number approaches negative infinity when the number approaches zero
  • In our case, since the result relfects the predicted probability of the correct label, we want our loss function to return a small value when the prediction is “good” (closer to 1) and a large value when the prediction is “bad” (closer to 0).
  • Notice how the loss is very large in the third and fourth rows where the predictions are confident and wrong, or in other words have high probabilities on the wrong class. One benefit of using the log to calculate the loss is that our loss function penalizes predictions that are both confident and wrong. This kind of penalty works well in practice to aid in more effective model training.
  • Calculating the loss pay attention only to the high softmax value.

Negative Log Likelihood

  • After taking the log of the softmax, we can then call the negative log likelihood.
    • first : log_softmax
    • then : nll_loss
    • or : nn.CrossEntropyLoss()
loss_func = nn.CrossEntropyLoss()
loss_func(acts, targs)
tensor(1.8045)
nn.CrossEntropyLoss()(acts, targs)
tensor(1.8045)
  • The nn.CrossEntrpyLoss() make do all the steps for us, but if we want to go through all those steps one by one softmas+log then negative log we could do it also:
F.nll_loss(nn.Softmax()(acts).log(), targs,)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  """Entry point for launching an IPython kernel.
tensor(1.8045)
  • Adding the reduction='none' to this functions will return the loss of each row, if we didn’t add this aparameter the fuction will return the mean loss of all rows.
nn.CrossEntropyLoss(reduction='none')(acts, targs)
tensor([0.5067, 0.6973, 2.0160, 5.6958, 0.9062, 1.0048])

Model Interpretation

  • As we saw in chapter 3 it’s hard for us to interpret the loss function, since it’s some the computers use in order to updates the parameters and optimize the performance.
  • But we can use some kind of demonstration that shows where the model did good, and where did bad.
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

  • Its was easy to understand what happened when they were only 3 classes in bears model, but here we have 37 breeds.
    • thats why we will useinterp.most_confused(min_val=5) to output to most bad decisions the model taked
interp.most_confused(min_val=5)
[('Ragdoll', 'Birman', 6)]
  • The best way to understand what happend is to google the names of each breed and see why the model confused it with the other breed, so we know that the model is in the right track

Improving Our Model

  • At this point all we can do is improve the model by correcting some detaills that may optimize the final prefromance

The Learning Rate Finder

  • One way of improving our model is by picking the right learning rate.
    • it will help to get faster result per epoch
    • minimize the loss and updating parameters with less steps
learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1, base_lr=0.1)
/usr/local/lib/python3.7/dist-packages/torchvision/models/_utils.py:209: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "
/usr/local/lib/python3.7/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
epoch train_loss valid_loss error_rate time
0 2.615709 5.479418 0.529093 01:11
epoch train_loss valid_loss error_rate time
0 3.068122 1.406392 0.443843 01:15
  • Here we pick a learning rate 0.1 which is 5 times bigger than the last one 0.002 and we get bad results error rate at: 0.5
    • big learning rate may reduce the computation needed for the training process but the model performance will be bad
  • Also if we pick a small learning rate it will take forever to achieve something.
  • The answear for this dilemma is The Learning Rate Finder
  • Fastai library adopte this method created by the resaercher Leslie Smith in a paper in 2015.
    • the idea of Smith is to start with a small learning rate (very small), and use it for one mini-batch, see how much the loss changed, and then start increasing the learning rate by some percentage (doubling it since its very small anyway)
    • repeate this process again(track the loss, double the learning rate ..) until the loss get worse.
    • at this point we just pick a learning rate smaller than the one that causes the loss to get worse.
  • Fastai course advice is either:
    • one order of magnitude less than where the minimun loss was achieved(divide by 10)
    • the last point where the loss was clearly decreasing
  • Both point are giving the same value usually.
learn = vision_learner(dls, resnet34, metrics=error_rate)
lr_min,lr_steep = learn.lr_find(suggest_funcs=(minimum, steep))

print(f"Minimum/10: {lr_min:.2e}, steepest point: {lr_steep:.2e}")
Minimum/10: 1.00e-02, steepest point: 4.37e-03
  • The plot shows that the loss between 10e-6 and 10e-3 almost didn’t change, but after it start to decrease until it reachs the minimum at 10e-1.
  • We don’t want a learning rate bigger than 10e-1 because there where the loss get worse, and we don’t need learning rate at 10e-1 because at this value we’ve left the stage where the loss was decreasing.
    • we need to pick the learning rate where the just start to decrease all the way to the minimum: 1e-3
learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(2, base_lr=3e-3)
epoch train_loss valid_loss error_rate time
0 1.327572 0.370063 0.120433 01:11
epoch train_loss valid_loss error_rate time
0 0.570779 0.429716 0.131935 01:15
1 0.322137 0.246614 0.073072 01:17
  • The error rate get better 10 times just by using the learning rate methode. Loss also get better by this percentage.

Unfreezing and Transfer Learning

  • We are familiar with the idea of Transfer Learning, where we use a pretrainned model on our dataset, by fine tuning it in a way that keep all the learned weights and use them in our task.

  • We know tha Convolutional Neural Network consist of many linear layers, and between each two of them there’s a nonlinear activation function (ReLU for example), followed by the final layer with an activation function such as Softmax. The final layer uses a matrix with enough columns such that the ouput size is has the number of classes our model trained to predict(assuming we have a classfication task) This final linear layer is unlikely to be of any use for us when we are fine-tuning in a transfer learning setting, because it is specifically designed to classify the categories in the original pretraining dataset.

  • So we first delete it when we start the transfer learning process, and replace it with a new linear layer with the correct number of outputs that matches our desired task(in this case 37 breeds, so 37 activations)

  • This new linear layer have total randome set of weights, but that doesn’t mean we should set all weights randomly even for the pretrained part.

    • All of the layers prior to the last one have been carefully trained to be good at image classification tasks in general. As we saw in the images from the Zeiler and Fergus paper, the first few layers encode very general concepts, such as finding gradients and edges, and later layers encode concepts that are still very useful for us, such as finding eyeballs and fur.
  • We want to build a model such as preserve all the learned weights, and apply them on our dataset, so only adjust them as required for the specifics of our particular task.

  • So, the idea is to keep the pretrained part’s weights intact, and only update the weights of the added part. This process is called Freezing

  • When we create a model from a pretrained network fastai automatically freezes all of the pretrained layers for us. When we call the fine_tune method fastai does two things:

    • Trains the randomly added layers for one epoch, with all other layers frozen
    • Unfreezes all of the layers, and trains them all for the number of epochs requested
  • Of Course this is just the default approach, fine_tune has many parameters that allow us to apply different tweaks for each specific situation.

  • For now, let’s do this process manually without using fine_tune

# check fine_tune source acode
learn.fine_tune??
  • First we create our learner from the dls and arch using vision_learner
    • by default vision_learner will freeze the pre-trained part of the model (freeze the params)
  • Then train the added layer with randome weights for number of epochs with a learning rate we pick
learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fit_one_cycle(3, 3e-3)
epoch train_loss valid_loss error_rate time
0 1.114612 0.429605 0.134641 01:10
1 0.538649 0.245115 0.083221 01:10
2 0.313570 0.207912 0.065629 01:10
  • Now we need to unfreeze the model:
learn.unfreeze()
  • Now we run lr_find again, because having more layers to train, and weights that have already been trained for three epochs, means our previously found learning rate isn’t appropriate any more:
learn.lr_find()
SuggestedLRs(valley=9.999999747378752e-06)

  • As we see here the graph is different than what we saw before when we use randome weights to train the model, because that the model has been trained already.
  • The approach to pick the right lr here is to chose a point before the sharp increase.
34se3a
learn.fit_one_cycle(6 , lr_max=4.786300905834651e-06)
epoch train_loss valid_loss error_rate time
0 0.260985 0.207057 0.062246 01:16
1 0.244476 0.201592 0.064276 01:18
2 0.240523 0.193317 0.058187 01:15
3 0.234724 0.189429 0.054127 01:15
4 0.227204 0.188406 0.056157 01:21
5 0.209963 0.187695 0.056157 01:16

Discriminative Learning Rates

  • After training the model for 6 epochs we get eror_rate at 6% which is fine, but we could do better.
  • The thing we could optimize here is to rethink the learning rate again.
    • picking one learning rate value for the whole neural network isn’t a good idea.
    • the model is consisted of 2 parts as we know:
      • the pre-trained part contained good parameters that has been trained for many epochs
      • the last layer which we trained ourself for not more than 10 (3+6)
    • so idea here is we shouldn’t trait both parts as if they are the same by picking one learning rate for the whole model
    • instead we could go with a small lr value for the first part, then aplly a slightly bigger one for the last layer.
  • This technic is devloped by Jason Yosinski and his team. They shows in 2014 that with transfer learning, different layer should be trained at different speed. att_00039.png
    • Fastai adopt this idea by using slice, which is a built-in object that let you pass 2 values:
      • the first define the learning rate of the earlier layer
      • the second for the last layers
    • The layers in between will have learning rates that are multiplicatively equidistant throughout that range
  • Let’s see this technic in action
learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fit_one_cycle(3, 3e-3)
learn.unfreeze()
learn.fit_one_cycle(14, lr_max=slice(1e-6,1e-4))
/usr/local/lib/python3.7/dist-packages/torchvision/models/_utils.py:209: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "
/usr/local/lib/python3.7/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
epoch train_loss valid_loss error_rate time
0 1.129876 0.383506 0.121786 01:10
1 0.516040 0.284697 0.092693 01:13
2 0.328486 0.217860 0.071042 01:11
epoch train_loss valid_loss error_rate time
0 0.269753 0.210582 0.071719 01:15
1 0.256654 0.203205 0.067659 01:17
2 0.245040 0.196284 0.066982 01:14
3 0.222485 0.197652 0.066306 01:14
4 0.186354 0.193144 0.062923 01:17
5 0.185777 0.189425 0.060217 01:15
6 0.150886 0.190105 0.060893 01:15
7 0.146768 0.186121 0.057510 01:18
8 0.134524 0.177772 0.054804 01:15
9 0.135853 0.180999 0.058187 01:15
10 0.127154 0.178239 0.056834 01:18
11 0.110540 0.179652 0.056157 01:15
12 0.122252 0.180609 0.056834 01:14
13 0.105743 0.180926 0.054804 01:17
  • We can plot the training and the validation loss
learn.recorder.plot_loss()

Selecting the Number of Epochs

  • Choosing the right amount of epoch you will train the model on is also something we should address properly.
  • We need to keep eye on the train/val loss as shown above, but also on error rate (or any metric we pick).
  • If the loss and the netric are getting better significantly at the end of training, that’s mean we didn’t train for too long
  • The loss is just something we use to allow the optimizer to have something it can different and optimize, it’s not something we really should care about in practice.
    • if the loss of the validation get worse at during the training because the model is getting over confident, only later it get worse because of overfitting, in practice we care only about the later issue
    • In case of overfitting, the easy solution is to retrain from scratch again, and this time select a total number of epochs based on where your previous best results were found
  • It’s not all about epochs, we could add more parameters to the model to get better result

Deeper Architectures

  • In general, more parameters handle the date more accuratly.
  • Using a deeper model is going to require more GPU RAM, so you may need to lower the size of your batches to avoid an out-of-memory error.
    • The way to solve it is to use a smaller batch size, which means passing smaller groups of images at any given time through your model. You can pass the batch size you want to the call creating your DataLoaders with bs=
  • The other downside of deeper architectures is that they take quite a bit longer to train.
    • One technique that can speed things up a lot is mixed-precision training. This refers to using less-precise numbers (half-precision floating point, also called fp16) where possible during training.
    • To enable this feature in fastai, just add to_fp16() after your Learner creation (you also need to import the module).
  • You can’t really know ahead of time what the best architecture for your particular problem is—you need to try training some. So let’s try a ResNet-50 now with mixed precision:
from fastai.callback.fp16 import *
learn = vision_learner(dls, resnet50, metrics=error_rate).to_fp16()
learn.fine_tune(6, freeze_epochs=3)
/usr/local/lib/python3.7/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
epoch train_loss valid_loss error_rate time
0 1.314968 0.331779 0.112314 01:07
1 0.600175 0.297889 0.089310 01:09
2 0.424932 0.264503 0.078484 01:06
epoch train_loss valid_loss error_rate time
0 0.294915 0.276637 0.076455 01:08
1 0.303050 0.266962 0.077131 01:10
2 0.241821 0.301895 0.086604 01:08
3 0.144625 0.222015 0.060217 01:08
4 0.082382 0.166509 0.056834 01:10
5 0.060593 0.161509 0.060893 01:08
  • We get better results, at less epochs, and less time per epochs only by usung deeper architecture.
    • but it’s allways better to start with small model, before scaling-up