Training a State-of-the-Art Model

In this chapter we will discuss more advanced technics for training image classifier models and getting better results.
In this chapter we will see:
- What is Normalization?
- Data augmentation with Mixup
- Progressive Risizing
- Test Time Augmentation
To implement all those technics, we will build model from scartch an train it on a subset of ImageNet called Imagenette

Imagenette

This dataset is created by Fastai community, the goal of it is to have a dataset that can train models that generelize well on the larger version (ImageNet), which will help Machine learning practitioners to build and experiment many ideas and projects with less computation power.
Imagenette has 10 classes, which are very different from one another.
Let’s download the dataset

! [ -e /content ] && pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

     |████████████████████████████████| 719 kB 5.1 MB/s 
     |████████████████████████████████| 5.3 MB 26.5 MB/s 
     |████████████████████████████████| 1.3 MB 49.1 MB/s 
     |████████████████████████████████| 441 kB 16.3 MB/s 
     |████████████████████████████████| 163 kB 50.3 MB/s 
     |████████████████████████████████| 212 kB 56.5 MB/s 
     |████████████████████████████████| 115 kB 39.6 MB/s 
     |████████████████████████████████| 127 kB 50.6 MB/s 
     |████████████████████████████████| 7.6 MB 72.8 MB/s 
Mounted at /content/gdrive

from fastbook import *

from fastai.vision.all import *
path = untar_data(URLs.IMAGENETTE)

100.00% [1557168128/1557161267 00:38<00:00]

Here we create DataLoaders using the Resize technic we saw in CH5

dblock = DataBlock(blocks=(ImageBlock(), CategoryBlock()),
                   get_items=get_image_files,
                   get_y=parent_label,
                   item_tfms=Resize(460),
                   batch_tfms=aug_transforms(size=224, min_scale=0.75))
dls = dblock.dataloaders(path, bs=64)

Now we do the training just to have something to begin with

model = xresnet50(n_out=dls.c)
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

epoch	train_loss	valid_loss	accuracy	time
0	1.615762	1.960073	0.446229	02:59
1	1.231709	1.436732	0.542943	03:01
2	0.946307	0.957584	0.703883	03:01
3	0.710163	0.686722	0.788648	02:59
4	0.581876	0.551411	0.833831	02:57

Not bad result considering that we didn’t use a pretrained model.
The aim of this chapter is to increase the peformance of this base-line model by implementing different technics

Normalization

Normalization means having a dataset with mean of 0 and standard diviation of 1.
Most images and computer vision libraries use values between 0 and 255 for pixels, or between 0 and 1; in either case, your data is not going to have a mean of 0 and a standard deviation of 1.
Let’s grab a batch of data and take the average of each axis except the channels axis (indx==1)

x,y = dls.one_batch()
x.mean(dim=[0,2,3]),x.std(dim=[0,2,3])

(TensorImage([0.4573, 0.4509, 0.4188], device='cuda:0'),
 TensorImage([0.2730, 0.2652, 0.2803], device='cuda:0'))

As expected, the mean and std is different than what we desire.
Fastai provide the method Normalize which can be applied on the whole batch, so we could add it to batch_tfms section of the datablock, we just need to tell fastai which are the mean and the standard deviation we want to use (in this case we will use ImageNet stats) but even without giving these stats fastai will calculate them from one batch and use them.
Notice here we build the datablock withing a function that take batch_size and size as parameters, we will see way later

def get_dls(bs, size):
    dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
                   get_items=get_image_files,
                   get_y=parent_label,
                   item_tfms=Resize(460),
                   batch_tfms=[*aug_transforms(size=size, min_scale=0.75),
                               Normalize.from_stats(*imagenet_stats)])
    return dblock.dataloaders(path, bs=bs)

dls = get_dls(64, 224)

x,y = dls.one_batch()
x.mean(dim=[0,2,3]),x.std(dim=[0,2,3])

(TensorImage([-0.0920, -0.0490,  0.0548], device='cuda:0'),
 TensorImage([1.1969, 1.1888, 1.2855], device='cuda:0'))

This Normalize method help us to get close to the desired values for mean and std
Let’s how this effect the performance of the model:

model = xresnet50(n_out=dls.c)
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

epoch	train_loss	valid_loss	accuracy	time
0	1.617465	1.945767	0.462285	02:58
1	1.272744	2.252786	0.462659	02:58
2	0.947452	1.084139	0.643017	02:58
3	0.722699	0.795408	0.765123	03:00
4	0.610676	0.577560	0.819642	02:58

Implementing the normalization didn’t help the model get better results, because this technic works better when using a pretrained model.

Progressive Resizing

The idea behind Progressive Resizing is to start training the model with small images, and end the training fase with large images.
As we have seen, the image size doesn’t play a role in the learning process.
Although changing the image size in the middel of training will effect how the model behave in a way or another.
The best way is to deal with the transition from small image to larger images as if we do transfer learning. After changing the images size we should fine_tune method.
We can look to Progressive Resizing as a form of data augmentation, which mean the model will generalize well.
To implement progressive resizing it is most convenient if you first create a get_dls function which takes an image size and a batch size as we did in the section before, and returns your DataLoaders:
Now you can create your DataLoaders with a small size and use fit_one_cycle in the usual way, training for a few less epochs than you might otherwise do:

dls = get_dls(128, 128)
learn = Learner(dls, xresnet50(n_out=dls.c), loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy)
learn.fit_one_cycle(4, 3e-3)

Then you can replace the DataLoaders inside the Learner, and fine-tune:

learn.dls = get_dls(64, 224)
learn.fine_tune(5, 1e-3)

We can increase the image size by a bit, untill we reach the normal size of the actual image in dataset.
Progressive resizing can be good or bad depending many factors:
- For transfer learning, if the model is already trained on similar task we will use it for, and on similar size of images, normally re-training it on smaller images will damage the performance.
- In other hand if the model is trained on dataset with different size of images, and on dfferent task, using progressive resizing may help the model performance.
There’s no right answear for every situation, we just need to try and experiment different things.

Test Time Augmentation

Till now all kind of data augmentations we implement are done on training set, while the validation set is taking the same images.
The idea behind Test Time Augmentation tta is to use some augmented images for the validation set to get predictions (for the same image) and average (or take the maximun) them.
Usually fastai do center cropping for validation.
This method may cause the model to miss valuable lessons from edges of the cropped images.
We could avoid center cropping but instead to select a number of areas to crop from the original rectangular image, pass each of them through our model, then take the average of predictions or the maximum
By default, fastai will use the unaugmented center crop image plus four randomly augmented images.
You can pass any DataLoader to fastai’s tta method; by default, it will use your validation set:

preds,targs = learn.tta()
accuracy(preds, targs).item()

Mixup

The idea of mixup is to take two data points (images) and mix them together with some percentage, then do the ine-hot encodings where the new image represented with the new values (percentages) instead of 0 or 1. The model here have to predict not only the right label, but also the percentage by which the label is represented in that image.
We can express this idea with code as:

church = PILImage.create(get_image_files_sorted(path/'train'/'n03028079')[0])
gas = PILImage.create(get_image_files_sorted(path/'train'/'n03425413')[0])
church = church.resize((256,256))
gas = gas.resize((256,256))
tchurch = tensor(church).float() / 255.
tgas = tensor(gas).float() / 255.

_,axs = plt.subplots(1, 3, figsize=(12,4))
show_image(tchurch, ax=axs[0]);
show_image(tgas, ax=axs[1]);
show_image((0.3*tchurch + 0.7*tgas), ax=axs[2]);

model = xresnet50(n_out=dls.c)
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy, cbs=MixUp())
learn.fit_one_cycle(5, 3e-3)

Training a model with mixup techninx make it way hareder for the model to learn, because the model need to predict 2 labels for each image instead of one, plus predicting the righ percentage of each label.
Overfitting seems less likely to be a problem when mixup is used.
Mixup tend to produce good results with 80 epochs of training or more.
One other benefit of using Mixup is when we have labels == 0, 1, because of using softmax and sigmoid the output can never be 0 or 1, thus our loss can never be perfect. With mixup labels cannot be 0 or 1 unless we mix 2 images with the same label, the rest of the time our labels will be linear combinations like 0.4, 0.6 ..

Label Smoothing

In classfication problems we usually have labels like 0 and 1, so the model job is to return them accuaratly, even 0.999 is not good enough where the label is 1!. this cuase the model to update the weights in order to get closer and closer to the right and unique answear, what will lead to overfitting.
The solution to this problem is to smooth the labels, by replacing 1 with a bit smaller number, and 0 with a bit bigger number. this will encourage the model to be less confident, which will help to better generalization.
Label Smoothing can be expressed mathematically:
- 0: \(\frac{\epsilon}{N}\) where N is number of classes we have and epsilon represent a parameter usually 0.1(it’s like saying we’re 10% less confident about the label)
- 1: \(1-\epsilon + \frac{\epsilon}{N}\)
In our Imagenette example where we have 10 classes, the targets become something like (here for a target that corresponds to the index 3):

[0.01, 0.01, 0.01, 0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]

To use this in practice, we just have to change the loss function in our call to Learner:

```python model = xresnet50(n_out=dls.c) learn = Learner(dls, model, loss_func=LabelSmoothingCrossEntropy(), metrics=accuracy) learn.fit_one_cycle(5, 3e-3)