! [ -e /content ] && pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

     |████████████████████████████████| 719 kB 27.5 MB/s 
     |████████████████████████████████| 5.3 MB 55.1 MB/s 
     |████████████████████████████████| 441 kB 70.9 MB/s 
     |████████████████████████████████| 1.3 MB 56.3 MB/s 
     |████████████████████████████████| 1.6 MB 57.3 MB/s 
     |████████████████████████████████| 115 kB 72.1 MB/s 
     |████████████████████████████████| 163 kB 71.5 MB/s 
     |████████████████████████████████| 212 kB 55.8 MB/s 
     |████████████████████████████████| 127 kB 75.4 MB/s 
     |████████████████████████████████| 115 kB 75.4 MB/s 
     |████████████████████████████████| 7.6 MB 56.7 MB/s 
Mounted at /content/gdrive

from fastbook import *
from fastai.vision.widgets import *
matplotlib.rc('image', cmap='Greys')

Under the Hood: Training a Digit Classifier

In this chapter we will try to understand the mechanisms of deep learning by solving a Computer Vision problem
- we will try to build a model from scartch that can classify hand written digits ( 7’s and 3’s)
In this chapter we will investigate how the model learn, and what’s are the key concepts of deep learning process

Pixels: The Foundations of Computer Vision

Computer are good with numbers, thats why in order to make them do computer vision tasks we need to turn images to series of numbers
We will use a small version of the famous dataset MNIST which contains only 2 digits 3 and 7.
Our task here is create a Neural Network from scratch that can calssify 3 from 7.

#download the dataset;
#MNIST_simple is a small mnist that contains only 7s and 3s
path = untar_data(URLs.MNIST_SAMPLE)

100.14% [3219456/3214948 00:01<00:00]

#here is where the dataset is stored
path

Path('/root/.fastai/data/mnist_sample')

# we can use ls() to investigate the dataset
# we have 3 directories: train, valid, labels.csv
path.ls()

(#3) [Path('/root/.fastai/data/mnist_sample/labels.csv'),Path('/root/.fastai/data/mnist_sample/train'),Path('/root/.fastai/data/mnist_sample/valid')]

#inside of train/valid folders, there's is 2 folders: 7, 3
(path/'train').ls()

(#2) [Path('/root/.fastai/data/mnist_sample/train/3'),Path('/root/.fastai/data/mnist_sample/train/7')]

#let's have a look at what in those folders while we sorted them an put them into variables ('threes and sevens')
threes = (path/'train'/'3').ls().sorted()
sevens = (path/'train'/'7').ls().sorted()

threes[-1], threes[22]

(Path('/root/.fastai/data/mnist_sample/train/3/9991.png'),
 Path('/root/.fastai/data/mnist_sample/train/3/10210.png'))

#let's open it 
img_path = threes[1]
img = Image.open(img_path)
img

In a computer, everything is represented as a numbers.
To view the numbers that make up this image, we have to convert it to a NumPy array or a PyTorch tensor.

# here we use numpy array to represent that image "img" as array (matrix) 
# the img here is represented as pixels
# the darker pixel are 0 or have values colser to 0, and the lighter pixels have higher values
array(img)[4:10, 4:10]

array([[  0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,  29],
       [  0,   0,   0,  48, 166, 224],
       [  0,  93, 244, 249, 253, 187],
       [  0, 107, 253, 253, 230,  48],
       [  0,   3,  20,  20,  15,   0]], dtype=uint8)

#here we use Pytorch tensor which is the numpy array function of pytorch
#they have the same code, and the behave basically the same (in most cases), 
#the main difference is tensors can be computed on GPu
a =tensor(img)[4:10, 4:10]
a

tensor([[  0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,  29],
        [  0,   0,   0,  48, 166, 224],
        [  0,  93, 244, 249, 253, 187],
        [  0, 107, 253, 253, 230,  48],
        [  0,   3,  20,  20,  15,   0]], dtype=torch.uint8)

We ciuld convert a image into array/tensor, then represent it as pandas DataFrame by coloring each pixel using background_gradient(), the darker the pixel is the closer it is to the highest value 252, this the image is the closer to 0.

# here we demonstrate the image and pixels values
# values of each pixel varies between 0 and 255
img_t= tensor(img)
df = pd.DataFrame(img_t)
df.style.set_properties(**{'font-size':'4pt'}).background_gradient('Greys')

	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
5	0	0	0	0	29	150	195	254	255	254	176	193	150	96	0	0
6	0	0	48	166	224	253	253	234	196	253	253	253	253	233	0	0
7	93	244	249	253	187	46	10	8	4	10	194	253	253	233	0	0
8	107	253	253	230	48	0	0	0	0	0	192	253	253	156	0	0
9	3	20	20	15	0	0	0	0	0	43	224	253	245	74	0	0
10	0	0	0	0	0	0	0	0	0	249	253	245	126	0	0	0
11	0	0	0	0	0	0	14	101	223	253	248	124	0	0	0	0
12	0	0	0	0	11	166	239	253	253	253	187	30	0	0	0	0
13	0	0	0	0	16	248	250	253	253	253	253	232	213	111	2	0
14	0	0	0	0	0	0	43	98	98	208	253	253	253	253	187	22
15	0	0	0	0	0	0	0	0	0	9	51	119	253	253	253	76
16	0	0	0	0	0	0	0	0	0	0	0	1	183	253	253	139
17	0	0	0	0	0	0	0	0	0	0	0	0	182	253	253	104
18	0	0	0	0	0	0	0	0	0	0	0	85	249	253	253	36
19	0	0	0	0	0	0	0	0	0	0	60	214	253	253	173	11
20	0	0	0	0	0	0	0	0	0	98	247	253	253	226	9	0
21	0	0	0	0	0	0	0	42	150	252	253	253	233	53	0	0
22	0	42	115	42	60	115	159	240	253	253	250	175	25	0	0	0
23	0	187	253	253	253	253	253	253	253	197	86	0	0	0	0	0
24	0	103	253	253	253	253	253	232	67	1	0	0	0	0	0	0
25	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
26	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
27	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

Base-Line Model

Is always good to start with a base-line model then try to build something more comlex on top of it
Base-line model help us build intuition and understand the prespectives of the problem
Consum less time to build

Pixel Similarity

Calculate the average values for each pixel location across all images for each digit
- This will generate a blurry image of the target digit
Compare the values for each pixel location in a new image to the average

# store all images of 7s and 3s as a list of tensors
sevens_tensors = [tensor(Image.open(o)) for o in sevens]
threes_tensors = [tensor(Image.open(o)) for o in threes]
len(sevens_tensors), len(threes_tensors)

(6265, 6131)

so now we have bunch of tensors, since they’re not image object anymore, we will use show_image of fast ai instead of PILimage
Remember we can always use show_image?? to read the documentation/ the code

# show a image from the tensor list
show_image(threes_tensors[330]);

Now we need to compute the value of each pixel in respect to all images (for that digit)
Put all images in a list of 3 dimension tensors, then stack them into a single tensor.
- stack images via pytorch function torch.stack
- and scale pixel values from the range [0,255] to [0,1]

stacked_threes= torch.stack(threes_tensors).float()/255
stacked_sevens= torch.stack(sevens_tensors).float()/255
stacked_threes.shape

torch.Size([6131, 28, 28])

.shape attribute tells us about the lenth of each axis
- in this case we can see we have 6131 images, each of size 28*28 pixel
Calculate the mean values for each pixel location across all images

mean3 = stacked_threes.mean(0)
mean7 = stacked_sevens.mean(0)

Sow our ideal 3 and 7

show_image(mean3)
show_image(mean7)

#pick a single 3, 7 to compare it with the ideal one
a_3 = stacked_threes[55]
a_7 = stacked_sevens[29]

show_image(a_3)
show_image(a_7)

so now let’s say that we want to recognize if a_3 is a 3 or 7?
to do that we can measure the distance between either of the two ideal 3 and 7**
- because just compunting the difference canot give us the right answear always since in some cases the difference will be negative!

to avoide that we can take 2 approaches:
* Take the mean of the absolute value of differences, this method called L1 norm or mean absolute difference .
* Take the mean of the square of differences (which makes everything positive) and then take the square root (which undoes the squaring), this method called L2 norm or root mean squared error (RMSE)

# Let's try both
dis_3_abs = (a_3-mean3).abs().mean()
dis_3_sqr = ((a_3-mean3)**2).mean().sqrt()
dis_3_abs, dis_3_sqr

(tensor(0.1280), tensor(0.2356))

dist_7_abs = (a_3 - mean7).abs().mean()
dist_7_sqr = ((a_3 - mean7)**2).mean().sqrt()
dist_7_abs,dist_7_sqr

(tensor(0.1832), tensor(0.3356))

based on those numbers the a_3 seems to be close to the ideal three than a_7to the ideal seven which seems to be right
In pytorch there’s a function that represent all of that for us:
- torch.nn.functional, wich is recommended to be called as F(in fastai this recommendation is standarized!).
- l1_loss stand for l1 norm, and mse_loss stand for l2 norm
  - mse_loss still need sqrt() to fully executed

F.l1_loss(a_3.float(), mean7), F.mse_loss(a_3, mean7).sqrt()

(tensor(0.1832), tensor(0.3356))

Computing metrics using Broadcasting

If we want to test the accuracy of a model we better measure it on validation set
so let’s create a tensor of 3’s and 7’s of validation set directory, then calculate the accuracy of our “model” based on every tensor(image) in validation set

#create tensors from image in validation set, then stack all image together
valid_ten_3 = torch.stack([tensor(Image.open(o)) 
                           for o in (path/'valid'/'3').ls()])
#turn them into float and devide them by 255
valid_ten_3 = valid_ten_3.float()/255
valid_ten_7 = torch.stack([tensor(Image.open(o))
                           for o in (path/'valid'/'7').ls()])
valid_ten_7 = valid_ten_7.float()/255
valid_ten_3.shape,valid_ten_7.shape

(torch.Size([1010, 28, 28]), torch.Size([1028, 28, 28]))

#create a function that will calculate the mean absolute error
def min_abser(a, b): return (a-b).abs().mean((-1,-2))
min_abser(valid_ten_3[345], mean3)

tensor(0.1123)

But this is only the absolut error with one image. Now we need to calculate this distance between the ideal 3/7 with all image in validation set in order to evaluate our model.
The easy method that will pop up in our mind in order to do that is to loop over all images in validation set and use the function min_abser() to calulate the difference over all of those images
The better way, is to use a method called Broadcasting which is a way of extending a tensor to match to others in order to do calulations

#example of broadcasting:
tensor([2, 3, 4])+tensor(-2)

tensor([0, 1, 2])

pytorch execute a calculation between 2 tensors with different ranks(dimension), we will take adventage of that method in order to do same with our case

# shape of mean3/7 and the shape of validation set tensor
mean3.shape, valid_ten_3.shape

(torch.Size([28, 28]), torch.Size([1010, 28, 28]))

There’s 1010 image and we want to compare that ideal image mean3 against
is_3() decide whether its 3 or 7 by copmuting which output of min_abser() is smaller

def is_3(x): return min_abser(x,mean3) < min_abser(x,mean7)

is_3(a_3), is_3(a_3).float()

(tensor(True), tensor(1.))

Note that when we convert the Boolean response to a float, we get 1.0 for True and 0.0 for False. Thanks to broadcasting, we can also test it on the full validation set of 3s:

is_3(valid_ten_3)

tensor([True, True, True,  ..., True, True, True])

Now we can calculate the accuracy for each of the 3s and 7s by taking the average of that function for all 3s and its inverse for all 7s:

accuracy_3s =      is_3(valid_ten_3).float() .mean()
accuracy_7s = (1 - is_3(valid_ten_7).float()).mean()

accuracy_3s,accuracy_7s,(accuracy_3s+accuracy_7s)/2

(tensor(0.9168), tensor(0.9854), tensor(0.9511))

We’re getting +90% accuracy just by doing base-line model!!
The base-Line model is good for understanding the problem and building the intuition, but we didn’t build a deep learning model yet.
- we did not add the Learning fase to our model, which is a crucial part according to Arthur Samuel definition
In other words, this model cannot be updated and improved by learning
- we can’t improve pixel similarity approach by modifying set of parameters, because we don’t have one! and that’s why we need SGD

Stochastic Gradient Descent

Instead of trying to find similarity between an image and the “ideal image” we could add set of weights to each pixel, such as the heighest are associated with darker pixel and lowest are more likely to be white.
- for example the pixel bottom right are white in 7s, so we will give them low weights
This can be represented as a function and set of weight values for each possible digit:
```
def pr_eight(x,w): return (x*w).sum()
```
- x is the image we’re predicting which will be represented as vector
- w is the weights of the image, also vector
The idea to find a method by which we update the weights, such as it can help us to make to predict better by a bit, then repeat those steps many times til we get the best prediction we can get.
- find the specific values for the vector w that causes the result of our function to be high for those images that are actually 8s, and low for those images that are not.
Here are the steps that we are going to require, to turn this function into a machine learning classifier:
- 1. Initialize the weights.
- 1. For each image, use these weights to predict whether it appears to be a 3 or a 7.
- 1. Based on these predictions, calculate how good the model is (its loss).
- 1. Calculate the gradient, which measures for each weight, how changing that weight would change the loss
- 1. Step (that is, change) all the weights based on that calculation.
- 1. Go back to the step 2, and repeat the process.
- 1. Iterate until you decide to stop the training process (for instance, because the model is good enough or you don’t want to wait any longer).

Apllying SGD on a simple case

Before applying these steps to our image classification problem, let’s illustrate what they look like in a simpler case.
First we will define a very simple function f, the quadratic—let’s pretend that this is our loss function, and x is a weight parameter of the function:

def f(x): return x**2

plot_function(f, 'x', 'x**2')

Let’s pick a rondom value for the paramter, and calculating the loss

plot_function(f, 'x', 'x**2')
plt.scatter(-1.5, f(-1.5), color='red');

now we will increase/decrease the parameter value by just a bit(0.5)and see what will happen:

plot_function(f,'x','x**2')
plt.scatter(-1.5, f(-1.5), color='red' );
plt.scatter(-1,f(-1), color='blue');

It seem that our loss get better(remember, the whole objective of this process is to minimize the loss to 0), we better keep increase the paramter but this time by (0.3), then(0.3):

plot_function(f,'x','x**2')
plt.scatter(-1.5, f(-1.5), color='red' );
plt.scatter(-1,f(-1), color='blue');
plt.scatter(-0.7,f(-0.7), color='green');
plt.scatter(-0.5, f(-0.5), color='orange');

The main idea here is to adjust the paramters in a way that causes a minimal loss
We can change our weight by a little in the direction of the slope, calculate our loss and adjustment again, and repeat this a few times. Eventually, we will get to the lowest point on our curve:

Screenshot 2022-09-13 at 14-25-28 04_mnist_basics - Jupyter Notebook.png

Calculating Gradients

It’s obvious now that in order to optimize the loss function we need to update the weights.
- to do that we need the help of calculus, it will allow us to update the weights in the direction that optimize the loss

Calculating derivative on Pytorch

#let's assume this variable
def f(x): return x**2
xt = tensor(3.).requires_grad_()

here we create a tensor at value 3 the we call the method requires_grad_ which will calculate the gradients in respect to that variable at that value.

# add the function f(x)=x**2 as a Y
yt = f(xt)
yt

tensor(9., grad_fn=<PowBackward0>)

# now we calculate the gradients using the backward() 
yt.backward()

#now we can views the gradients by checking the `grad` attribute of the tensor:
xt.grad

tensor(6.)

#now we will repeat the same steps but now with a vector:
xt = tensor([3.,4.,10.]).requires_grad_()

def f(x): return (x**2).sum()

#let's check our function
f(xt)

tensor(125., grad_fn=<SumBackward0>)

yt = f(xt)

#calculate the gradients
yt.backward()

#as expected 2xt
xt.grad

tensor([ 6.,  8., 20.])

#let's create another example
x= tensor(3.).requires_grad_()
w= tensor(4.).requires_grad_()
b= tensor(5.).requires_grad_()
y = x * w + b
y

tensor(17., grad_fn=<AddBackward0>)

y.backward()

print('d(y)/dx= ', x.grad)
print('d(y)/dw= ', w.grad)
print('d(y)/bx= ', b.grad)

d(y)/dx=  tensor(4.)
d(y)/dw=  tensor(3.)
d(y)/bx=  tensor(1.)

Now we have all the ingredients to apply what we have learned on a real problem.
- the goal here is to apply these 7 steps we saw in order to optimize the weights which will effect the loss, which also effect the accuracy of the model
one last thing before we do that, we will talk about Learnin rate
Gradient descent allow us to correct our weights by taking steps toward the optimal value of these weights, but it didn’t tell us how big or small these steps are, thats why we have to initilize a learning rate.
Learning rate is a value, by which our gradient calculate new weights in order to get better loss.
In general the learning rate is some randome value we chase between 0.1 and 0.001
Once we have picked a learning rate we then adjust the parameters using this simple function:

w-=gradient(w)*lr

An End-to-End SGD Example

Suppose we want to predict the speed of a Roller Coaster over a hump.
We want to build a model that predict how the speed change over time
- we measure the speed of the 20 seconds manualy/ every second:

time = torch.arange(20).float();
time

tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19.])

It might look something like this:

# we add some noise to the data since this is the way we usualy find data in real world
speed = torch.randn(20)*3 + 0.75*(time-9.5)**2 + 1
plt.scatter(time,speed);

The goal here is to find a function that matches our observastions using SGD
We chose here a quadratic function!.
- we need to distinguish clearly between the input of the function t (the that’s the product of our observation) and it’s parameters params

#let's just assume this  quadratic function!!!

def f(t, params):
    a, b, c= params
    return a*(t**2)+(b*t)+c

The we need to set a loss function that will tell us how good or bad our prediction of the parameters a, b, c

def mse(preds, targets): return((preds - targets)**2).mean().sqrt()

Step 1: Initialize the parameters

first we need to initialize the parametrs by telling pytorch that we want to track its gradients

params= torch.randn(3).requires_grad_()
params

tensor([-0.7658, -0.7506,  1.3525], requires_grad=True)

#??
orig_params = params.clone()

Step 2: Calculate the predictions

Next we calculate the prediction

preds = f(time, params)
preds

tensor([ 1.3525e+00, -1.6391e-01, -3.2121e+00, -7.7919e+00, -1.3903e+01, -2.1547e+01, -3.0721e+01, -4.1428e+01, -5.3666e+01, -6.7436e+01, -8.2738e+01, -9.9571e+01, -1.1794e+02, -1.3783e+02,
        -1.5926e+02, -1.8222e+02, -2.0671e+02, -2.3274e+02, -2.6029e+02, -2.8938e+02], grad_fn=<AddBackward0>)

The pred here are the function we create earlier. * f(t, params): am bm c= params return a*(t**2)+(b*t)+c - where the t==time, and params are the initialized values we took in the step before, so the f will be calculated by taking those values as a part of it

# plot the actual calues and our prediction 
def show_preds(preds, ax=None):
    if ax is None: ax=plt.subplots()[1]
    ax.scatter(time, speed)
    ax.scatter(time, to_np(preds), color='red')
    ax.set_ylim(-300,100)

show_preds(preds)

The blue dot represent the actal value
The red dots represent what our model
- predictions are really bad, but we need to remember that these prediction are based on Random Values
So our job here is to update these values to get better predictions

Step 3: Calculate the loss

We calculate now the loss of our prediction
Remember:

def mse(preds, targets): return((preds - targets)**2).mean().sqrt()

loss= mse(preds, speed)
loss

tensor(160.6979, grad_fn=<SqrtBackward0>)

Our goal is to improve this loss function by minimizing it, to be as close as possible to the target(speed)

Step 4: Calculate the gradients

In order to minimize the loss function we use the gradients to aproximate how the parameters can be changed to achieve our goal

loss.backward()
params.grad

tensor([-165.5151,  -10.6402,   -0.7900])

params.grad

tensor([-165.5151,  -10.6402,   -0.7900])

params

tensor([-0.7658, -0.7506,  1.3525], requires_grad=True)

Step 5: Step the weights

We need to updated the weights(paramaters) we just calculated
- first we need to fix the learning rate
- then we need to call .data on the parameters and parameters.grad, it’s like telling pytorch to not track the change in the gradients at this point.

lr = 1e-5
params.data -= lr * params.grad.data
params.grad = None

Let’s see if the loss has improved:

preds = f(time,params)
mse(preds, speed)

tensor(160.4228, grad_fn=<SqrtBackward0>)

what’s about the plot

show_preds(preds)

Now let’s execute all steps in one function

def apply_step(params, prn=True):
    preds = f(time, params)
    loss = mse(preds, speed)
    loss.backward()
    params.data -= lr * params.grad.data
    params.grad = None
    if prn: print(loss.item())
    return preds

Let’s execute this steps for 100s of times

for i in range (1200): apply_step(params)

160.42279052734375
160.14772033691406
159.87269592285156
159.59768676757812
159.3227081298828
159.04774475097656
158.7728271484375
158.4979248046875
158.22305297851562
157.9481964111328
157.67337036132812
157.39857482910156
157.12380981445312
156.84906005859375
156.5743408203125
156.29966735839844
156.02499389648438
155.75035095214844
155.4757537841797
155.20118713378906
154.92662048339844
154.65211486816406
154.37762451171875
154.1031494140625
153.82872009277344
153.55430603027344
153.27992248535156
153.0055694580078
152.73126220703125
152.4569549560547
152.18270874023438
151.90847778320312
151.63426208496094
151.36009216308594
151.08595275878906
150.81185913085938
150.5377655029297
150.26370239257812
149.9897003173828
149.71571350097656
149.44175720214844
149.16783142089844
148.89393615722656
148.6200714111328
148.3462371826172
148.0724334716797
147.79867553710938
147.52493286132812
147.25123596191406
146.97756958007812
146.7039337158203
146.43032836914062
146.15676879882812
145.88323974609375
145.60971069335938
145.3362579345703
145.06283569335938
144.78941345214844
144.51605224609375
144.24273681640625
143.96942138671875
143.6961669921875
143.4229278564453
143.14974975585938
142.87661743164062
142.6034698486328
142.3303985595703
142.05734252929688
141.78433227539062
141.5113525390625
141.23841857910156
140.96554565429688
140.69265747070312
140.41983032226562
140.14703369140625
139.87429809570312
139.60157775878906
139.3289031982422
139.05625915527344
138.78367614746094
138.51112365722656
138.2386016845703
137.96612548828125
137.69369506835938
137.42127990722656
137.14894104003906
136.87660217285156
136.60433959960938
136.33209228515625
136.0598907470703
135.7877197265625
135.515625
135.2435302734375
134.9715118408203
134.69952392578125
134.42759704589844
134.1556854248047
133.88381958007812
133.61199951171875
133.3402099609375
133.06849670410156
132.79681396484375
132.52517700195312
132.25357055664062
131.98204040527344
131.71051025390625
131.43905639648438
131.16763305664062
130.89627075195312
130.62493896484375
130.35366821289062
130.08242797851562
129.81126403808594
129.54013061523438
129.26902770996094
128.99798583984375
128.72698974609375
128.4560546875
128.18515014648438
127.91431427001953
127.64350891113281
127.37277221679688
127.1020736694336
126.83142852783203
126.56082916259766
126.29029083251953
126.01979064941406
125.74934387207031
125.47895812988281
125.2086410522461
124.93834686279297
124.6681137084961
124.39794921875
124.12782287597656
123.85773468017578
123.58772277832031
123.31776428222656
123.04785919189453
122.77799987792969
122.5082015991211
122.23847198486328
121.96878814697266
121.69915771484375
121.42959594726562
121.16007995605469
120.89063262939453
120.6212387084961
120.35188293457031
120.08260345458984
119.81338500976562
119.54422760009766
119.27513122558594
119.00609588623047
118.73711395263672
118.46820831298828
118.19935607910156
117.93054962158203
117.66181182861328
117.39315795898438
117.12455749511719
116.85600280761719
116.5875473022461
116.31912994384766
116.05079650878906
115.78250122070312
115.51428985595703
115.24613952636719
114.97806549072266
114.71004486083984
114.44210052490234
114.17422485351562
113.90641784667969
113.63868713378906
113.37100982666016
113.10340118408203
112.83587646484375
112.56842041015625
112.30104064941406
112.03372192382812
111.7664794921875
111.49931335449219
111.23222351074219
110.96520233154297
110.6982421875
110.43138122558594
110.16458129882812
109.89786529541016
109.63121795654297
109.36465454101562
109.09815979003906
108.83175659179688
108.56542205810547
108.2991714477539
108.03300476074219
107.76690673828125
107.50091552734375
107.23497772216797
106.96913146972656
106.70336151123047
106.43769073486328
106.17208862304688
105.90658569335938
105.64115142822266
105.37582397460938
105.1105728149414
104.84541320800781
104.58033752441406
104.31534576416016
104.05045318603516
103.78563690185547
103.52091979980469
103.25628662109375
102.99176025390625
102.72731018066406
102.46295928955078
102.1987075805664
101.9345474243164
101.67047119140625
101.40650177001953
101.14262390136719
100.87885284423828
100.61516571044922
100.35159301757812
100.0881118774414
99.8247299194336
99.56144714355469
99.29827117919922
99.03519439697266
98.772216796875
98.50934600830078
98.24658966064453
97.98394012451172
97.72138977050781
97.45895385742188
97.19661712646484
96.93439483642578
96.67227172851562
96.41026306152344
96.14837646484375
95.88658905029297
95.62493133544922
95.3633804321289
95.1019287109375
94.84060668945312
94.57940673828125
94.31830596923828
94.0573501586914
93.79650115966797
93.5357666015625
93.27516174316406
93.0146713256836
92.75431823730469
92.49407958984375
92.23397827148438
91.97398376464844
91.7141342163086
91.45439910888672
91.1948013305664
90.93533325195312
90.67599487304688
90.41679382324219
90.15772247314453
89.89878845214844
89.63999938964844
89.3813247680664
89.122802734375
88.86441802978516
88.60618591308594
88.34807586669922
88.09011840820312
87.8322982788086
87.57463073730469
87.31710815429688
87.05973052978516
86.80250549316406
86.54542541503906
86.28850555419922
86.03172302246094
85.77511596679688
85.51864624023438
85.26234436035156
85.00618743896484
84.75019836425781
84.49436950683594
84.23870086669922
83.98320007324219
83.72785949707031
83.4726791381836
83.21766662597656
82.96282958984375
82.70816802978516
82.45367431640625
82.19934844970703
81.94519805908203
81.69123077392578
81.43744659423828
81.18382263183594
80.93038940429688
80.67713928222656
80.42406463623047
80.17118072509766
79.91848754882812
79.66597747802734
79.41366577148438
79.16154479980469
78.90961456298828
78.65788269042969
78.4063491821289
78.15501403808594
77.90387725830078
77.65293884277344
77.4022216796875
77.15169525146484
76.90137481689453
76.65127563476562
76.40138244628906
76.15170288085938
75.9022445678711
75.65299224853516
75.40396881103516
75.15515899658203
74.90658569335938
74.6582260131836
74.41009521484375
74.16220092773438
73.9145278930664
73.6670913696289
73.4198989868164
73.17293548583984
72.92621612548828
72.67974853515625
72.43351745605469
72.18753051757812
71.94180297851562
71.69631958007812
71.45108795166016
71.20611572265625
70.9614028930664
70.71694946289062
70.47276306152344
70.22883605957031
69.98519134521484
69.74180603027344
69.49870300292969
69.25586700439453
69.01331329345703
68.77104187011719
68.52906036376953
68.28734588623047
68.04594421386719
67.80481719970703
67.56399536132812
67.3234634399414
67.08323669433594
66.84332275390625
66.60370635986328
66.36438751220703
66.12539672851562
65.88671875
65.64836120605469
65.41032409667969
65.17259979248047
64.93521881103516
64.69815063476562
64.46143341064453
64.22505187988281
63.98899841308594
63.7532958984375
63.5179328918457
63.282928466796875
63.04827117919922
62.81398010253906
62.580047607421875
62.34648132324219
62.11327362060547
61.88043975830078
61.64798355102539
61.41590118408203
61.184200286865234
60.95288848876953
60.72196960449219
60.49143981933594
60.26130294799805
60.031578063964844
59.80224609375
59.573326110839844
59.34482192993164
59.116729736328125
58.889068603515625
58.66182327270508
58.43500518798828
58.2086296081543
57.98268127441406
57.757179260253906
57.532127380371094
57.307525634765625
57.0833740234375
56.85968780517578
56.6364631652832
56.413700103759766
56.19141387939453
55.969608306884766
55.74828338623047
55.52744674682617
55.30709457397461
55.087249755859375
54.867897033691406
54.649051666259766
54.430721282958984
54.21290969848633
53.995609283447266
53.77884292602539
53.56260299682617
53.34690475463867
53.131744384765625
52.91713333129883
52.703067779541016
52.489566802978516
52.2766227722168
52.06424331665039
51.852447509765625
51.641231536865234
51.43058776855469
51.220542907714844
51.01108932495117
50.802242279052734
50.593997955322266
50.38636779785156
50.179351806640625
49.972965240478516
49.7672119140625
49.56209182739258
49.35761260986328
49.153778076171875
48.95060348510742
48.74808120727539
48.54623031616211
48.345054626464844
48.14455032348633
47.944732666015625
47.745609283447266
47.547176361083984
47.34945297241211
47.152435302734375
46.95613479614258
46.76055908203125
46.565704345703125
46.3715934753418
46.17822265625
45.985599517822266
45.79372787475586
45.60261535644531
45.41227340698242
45.22270965576172
45.03392028808594
44.84592056274414
44.65871810913086
44.47231674194336
44.286712646484375
44.1019287109375
43.91796875
43.734832763671875
43.552528381347656
43.37106704711914
43.190452575683594
43.01069259643555
42.83179473876953
42.65375900268555
42.476600646972656
42.30031967163086
42.12492752075195
41.9504280090332
41.77682876586914
41.6041374206543
41.43235778808594
41.26150131225586
41.09156799316406
40.92256164550781
40.75450134277344
40.5873908996582
40.42121887207031
40.256011962890625
40.091773986816406
39.92850112915039
39.766204833984375
39.60489273071289
39.4445686340332
39.285240173339844
39.12691116333008
38.96958923339844
38.81327438354492
38.65798568725586
38.50371551513672
38.35047149658203
38.19826889038086
38.04710006713867
37.896976470947266
37.74790954589844
37.59988784790039
37.45293426513672
37.30704116821289
37.1622200012207
37.01847457885742
36.87580490112305
36.73421859741211
36.593719482421875
36.454307556152344
36.31599807739258
36.17878341674805
36.04267501831055
35.90766906738281
35.77377700805664
35.640995025634766
35.509334564208984
35.378787994384766
35.24936294555664
35.121063232421875
34.993892669677734
34.86784744262695
34.74293899536133
34.619163513183594
34.49652099609375
34.37501907348633
34.2546501159668
34.13542556762695
34.01734161376953
33.90039825439453
33.78459548950195
33.66993713378906
33.55642318725586
33.44404983520508
33.332820892333984
33.22273635864258
33.113792419433594
33.00598907470703
32.89932632446289
32.79380416870117
32.68942642211914
32.58617401123047
32.484066009521484
32.38309097290039
32.28324508666992
32.18452835083008
32.086936950683594
31.99047088623047
31.89512825012207
31.800899505615234
31.70779037475586
31.61579132080078
31.5248966217041
31.435110092163086
31.346420288085938
31.25882911682129
31.17232894897461
31.086910247802734
31.002580642700195
30.919322967529297
30.83713722229004
30.756017684936523
30.675962448120117
30.596956253051758
30.51900291442871
30.442092895507812
30.36622428894043
30.291379928588867
30.217559814453125
30.14476203918457
30.07297134399414
30.00218963623047
29.93239974975586
29.863605499267578
29.79578971862793
29.72895050048828
29.6630802154541
29.598169326782227
29.534212112426758
29.4711971282959
29.409116744995117
29.347970962524414
29.287738800048828
29.228425979614258
29.170013427734375
29.11249351501465
29.055862426757812
29.000110626220703
28.945226669311523
28.891206741333008
28.83803367614746
28.78570556640625
28.734214782714844
28.683551788330078
28.63370132446289
28.584657669067383
28.53641700744629
28.488964080810547
28.442289352416992
28.396390914916992
28.35125160217285
28.306867599487305
28.263225555419922
28.220321655273438
28.178142547607422
28.136682510375977
28.095928192138672
28.055875778198242
28.016508102416992
27.977825164794922
27.9398136138916
27.902463912963867
27.865768432617188
27.829721450805664
27.794307708740234
27.759521484375
27.725353240966797
27.691795349121094
27.65884017944336
27.626476287841797
27.594696044921875
27.56348991394043
27.532852172851562
27.50277328491211
27.473243713378906
27.444255828857422
27.415802001953125
27.38787269592285
27.360462188720703
27.333559036254883
27.30715560913086
27.281248092651367
27.25582504272461
27.230878829956055
27.206403732299805
27.182390213012695
27.158830642700195
27.135719299316406
27.113048553466797
27.090810775756836
27.06899642944336
27.047603607177734
27.02661895751953
27.006038665771484
26.985857009887695
26.966068267822266
26.946659088134766
26.927631378173828
26.908971786499023
26.890676498413086
26.872737884521484
26.855154037475586
26.837913513183594
26.821012496948242
26.804445266723633
26.788204193115234
26.77228546142578
26.756681442260742
26.74138832092285
26.726398468017578
26.711708068847656
26.697311401367188
26.683198928833008
26.66937255859375
26.65582275390625
26.642541885375977
26.629533767700195
26.616785049438477
26.604291915893555
26.592052459716797
26.580062866210938
26.568313598632812
26.55680274963379
26.54552459716797
26.53447723388672
26.52365493774414
26.513051986694336
26.502666473388672
26.492490768432617
26.482528686523438
26.472766876220703
26.46320343017578
26.453842163085938
26.44466781616211
26.43568229675293
26.4268856048584
26.41826820373535
26.409826278686523
26.401559829711914
26.39346694946289
26.38553810119629
26.377775192260742
26.37017250061035
26.362730026245117
26.355438232421875
26.348299026489258
26.341310501098633
26.33446502685547
26.327762603759766
26.32120132446289
26.314775466918945
26.30848503112793
26.302322387695312
26.29629135131836
26.29038429260254
26.28460121154785
26.27894401550293
26.27340316772461
26.267974853515625
26.262662887573242
26.257463455200195
26.252370834350586
26.247386932373047
26.242507934570312
26.23773193359375
26.233055114746094
26.228477478027344
26.223997116088867
26.219608306884766
26.215314865112305
26.211111068725586
26.20699691772461
26.20296859741211
26.19902801513672
26.195167541503906
26.191389083862305
26.18769073486328
26.184070587158203
26.18052864074707
26.177061080932617
26.173667907714844
26.17034339904785
26.16709327697754
26.163909912109375
26.16079330444336
26.157745361328125
26.154760360717773
26.151840209960938
26.14898109436035
26.146183013916016
26.14344596862793
26.140766143798828
26.13814353942871
26.135576248168945
26.133060455322266
26.130603790283203
26.128196716308594
26.125839233398438
26.123537063598633
26.121280670166016
26.11907196044922
26.11690902709961
26.114797592163086
26.11272430419922
26.110698699951172
26.108718872070312
26.10677719116211
26.10487937927246
26.103023529052734
26.10120391845703
26.09942626953125
26.09768295288086
26.095979690551758
26.094310760498047
26.092681884765625
26.091081619262695
26.089519500732422
26.08799171447754
26.08649253845215
26.085031509399414
26.083599090576172
26.082195281982422
26.080820083618164
26.079477310180664
26.07816505432129
26.07687759399414
26.075618743896484
26.074386596679688
26.073179244995117
26.072004318237305
26.070846557617188
26.069719314575195
26.068614959716797
26.06753158569336
26.06647300720215
26.06543731689453
26.064424514770508
26.063432693481445
26.062463760375977
26.061511993408203
26.060583114624023
26.059673309326172
26.05878257751465
26.057912826538086
26.05706214904785
26.056228637695312
26.055410385131836
26.05461311340332
26.0538330078125
26.053068161010742
26.05232048034668
26.05158805847168
26.050870895385742
26.0501708984375
26.049484252929688
26.04881477355957
26.04815673828125
26.04751205444336
26.046886444091797
26.046268463134766
26.04566764831543
26.04507827758789
26.04450035095215
26.0439395904541
26.043384552001953
26.0428466796875
26.04231834411621
26.04180145263672
26.04129409790039
26.040800094604492
26.040315628051758
26.03984260559082
26.039377212524414
26.038921356201172
26.038476943969727
26.038042068481445
26.037616729736328
26.037200927734375
26.036794662475586
26.036394119262695
26.0360050201416
26.03562355041504
26.03525161743164
26.03488540649414
26.034526824951172
26.034177780151367
26.03383445739746
26.033498764038086
26.033170700073242
26.032852172851562
26.032535552978516
26.032228469848633
26.03192901611328
26.031635284423828
26.03134536743164
26.031064987182617
26.03078842163086
26.030517578125
26.030254364013672
26.02999496459961
26.02974510192871
26.029499053955078
26.029254913330078
26.029020309448242
26.028785705566406
26.0285587310791
26.028337478637695
26.028120040893555
26.027908325195312
26.027700424194336
26.027498245239258
26.02729606628418
26.027101516723633
26.02690887451172
26.026721954345703
26.026540756225586
26.0263614654541
26.026187896728516
26.026016235351562
26.025848388671875
26.025684356689453
26.025524139404297
26.025365829467773
26.02521324157715
26.025062561035156
26.02491569519043
26.024770736694336
26.024627685546875
26.024494171142578
26.024356842041016
26.02422523498535
26.02409553527832
26.023969650268555
26.023847579956055
26.023725509643555
26.023605346679688
26.023488998413086
26.023374557495117
26.023263931274414
26.023155212402344
26.023048400878906
26.0229434967041
26.022842407226562
26.022741317749023
26.02264404296875
26.022546768188477
26.02245330810547
26.022363662719727
26.02227210998535
26.022186279296875
26.022096633911133
26.022016525268555
26.02193260192871
26.021852493286133
26.021772384643555
26.02169418334961
26.021617889404297
26.02154541015625
26.021472930908203
26.02140235900879
26.021331787109375
26.021265029907227
26.021198272705078
26.021133422851562
26.02107048034668
26.021007537841797
26.020946502685547
26.02088737487793
26.020828247070312
26.020771026611328
26.020713806152344
26.020658493041992
26.020605087280273
26.020551681518555
26.0205020904541
26.020450592041016
26.020402908325195
26.020353317260742
26.020305633544922
26.0202579498291
26.020214080810547
26.02016830444336
26.020126342773438
26.020084381103516
26.020042419433594
26.020000457763672
26.019960403442383
26.019920349121094
26.019882202148438
26.019845962524414
26.019807815551758
26.019771575927734
26.019737243652344
26.01970100402832
26.01966667175293
26.019634246826172
26.019601821899414
26.019569396972656
26.01953887939453
26.019508361816406
26.01947784423828
26.019451141357422
26.019418716430664
26.019393920898438
26.019365310668945
26.019336700439453
26.019309997558594
26.019285202026367
26.019258499145508
26.019235610961914
26.019208908081055
26.019187927246094
26.019163131713867
26.019142150878906
26.01911735534668
26.01909637451172
26.019075393676758
26.019054412841797
26.019033432006836
26.019012451171875
26.018991470336914
26.018972396850586
26.018953323364258
26.01893424987793
26.0189151763916
26.018898010253906
26.01888084411621
26.018861770629883
26.01884651184082
26.018829345703125
26.018814086914062
26.018798828125
26.018781661987305
26.01876449584961
26.018753051757812
26.018735885620117
26.018722534179688
26.018707275390625
26.018693923950195
26.018678665161133
26.018667221069336
26.018651962280273
26.018638610839844
26.018625259399414
26.018613815307617
26.01860237121582
26.018590927124023
26.018579483032227
26.018566131591797
26.0185546875
26.018545150756836
26.01853370666504
26.018524169921875
26.018510818481445
26.01850128173828
26.018491744995117
26.018484115600586
26.018470764160156
26.018461227416992
26.018451690673828
26.018442153930664
26.018434524536133
26.018423080444336
26.018417358398438
26.018409729003906
26.018400192260742
26.018390655517578
26.018383026123047
26.018375396728516
26.018367767333984
26.018360137939453
26.018352508544922
26.01834487915039
26.01833724975586
26.018329620361328
26.018321990966797
26.018314361572266
26.018308639526367
26.01830291748047
26.01829719543457
26.018287658691406
26.01828384399414
26.01827621459961
26.01827049255371
26.018264770507812
26.018259048461914
26.018251419067383
26.018245697021484
26.018239974975586
26.018234252929688
26.01822853088379
26.01822280883789
26.018217086791992
26.01820945739746
26.018207550048828
26.018199920654297
26.01819610595703
26.0181884765625
26.018184661865234
26.0181827545166
26.018177032470703
26.018171310424805
26.01816749572754
26.01816177368164
26.018157958984375
26.01815414428711
26.01814842224121
26.018144607543945
26.018138885498047
26.01813507080078
26.018131256103516
26.01812744140625
26.018123626708984
26.01811981201172
26.018115997314453
26.018110275268555
26.018108367919922
26.018102645874023
26.01810073852539
26.018096923828125
26.018091201782227
26.01808738708496
26.018085479736328
26.018081665039062
26.018077850341797
26.018075942993164
26.018070220947266
26.01806640625
26.018062591552734
26.018062591552734
26.018056869506836
26.018054962158203
26.018049240112305
26.018049240112305
26.01804542541504
26.018041610717773
26.018037796020508
26.018035888671875
26.018033981323242
26.018030166625977
26.01802635192871
26.018022537231445
26.018020629882812
26.01801872253418
26.018014907836914
26.01801300048828
26.01801109313965
26.01800537109375
26.018003463745117
26.018001556396484
26.01799774169922
26.01799774169922
26.017993927001953
26.017990112304688
26.017988204956055
26.01798439025879
26.017982482910156
26.017980575561523
26.01797866821289
26.017974853515625
26.017972946166992
26.017969131469727
26.017967224121094
26.01796531677246
26.017963409423828
26.017959594726562
26.017959594726562
26.017955780029297
26.017953872680664
26.0179500579834
26.017948150634766
26.0179443359375
26.0179443359375
26.017942428588867
26.017940521240234
26.01793670654297
26.01793670654297
26.017932891845703
26.017929077148438
26.017929077148438
26.017927169799805
26.01792335510254
26.01792335510254
26.017919540405273
26.01791763305664
26.017913818359375
26.017911911010742
26.017911911010742
26.017908096313477
26.017906188964844
26.017902374267578
26.017902374267578
26.017898559570312
26.017898559570312
26.017894744873047
26.017892837524414
26.017892837524414
26.01789093017578
26.01788902282715
26.01788330078125
26.01788330078125
26.017881393432617
26.017879486083984
26.01787757873535
26.01787567138672
26.01787567138672
26.01786994934082
26.01786994934082
26.017868041992188
26.017868041992188
26.01786231994629
26.01786231994629
26.017860412597656
26.017860412597656
26.01785659790039
26.01785659790039
26.017852783203125
26.017850875854492
26.017847061157227
26.017847061157227
26.017845153808594
26.017845153808594
26.017841339111328
26.017837524414062
26.017837524414062
26.01783561706543
26.017833709716797
26.017831802368164
26.017831802368164
26.01782989501953
26.01782989501953
26.017826080322266
26.017822265625
26.017822265625
26.017820358276367
26.017818450927734
26.0178165435791
26.01781463623047
26.017812728881836
26.017810821533203
26.01780891418457
26.017807006835938
26.017805099487305
26.017805099487305
26.01780128479004
26.017799377441406
26.017797470092773
26.017799377441406
26.017793655395508
26.01779556274414
26.01778793334961
26.017789840698242
26.01778793334961

After 1200 iterations we manage to reduce the loss from 160. to 26.
Now let’s plot the process

_,axs = plt.subplots(1,4,figsize=(12,3))
for ax in axs: show_preds(apply_step(params, False), ax)
plt.tight_layout()

After each iteration we will have a brand new quadratic function, which will be much more close to the actual one that represent the real data

Step 7: stop

After n# of iterations we decide to stop

Summarizing Gradient Descent

At the beginnig we start with the Weights
- pick the randomly if we build the model from scartch
- get them pretrained from another model : TransferLearning
Build a Loss Function
- allow us to see how good or bad the outputs the model gives us
- then we try to change/update the weights in way that makes the loss function better==lower
To find a way to do that (update the weights to optimize the loss) we use calculus to caltulate the Gradients
- gradients descent tells us either we decrease or increase the weights in order to minimize the loss (that simple!!)
We then iterate till we reached the lowest point
Stop

The MNIST Loss Function

We saw previously the function pr_eights() where we represent the input images x as vector, just like the w weight vector.
- here we will do the same with our MNIST_sample
We already have our dataset for 3s and 7s as tensors == the x of the function:
- training-set: stacked_threes, stacked_sevens
- validation-set : valid_ten_3, valid_ten_7
We’ll concatenate them all into a single tensor, and also change them from a list of matrices (a rank-3 tensor) to a list of vectors (a rank-2 tensor)
- by using .view
- which is a Pytorh method that changes the shape of a tensor without changing its content
- -1 is a special parameter to .view that mean: make this axis as big as necessary to fit the data

train_x = torch.cat([stacked_threes, stacked_sevens]).view(-1, 28*28)

train_x.shape

torch.Size([12396, 784])

We need a label for each image. We’ll use 1 for 3s and 0 for 7s:

train_y = tensor([1]*len(threes) + [0]*len(sevens))
train_x.shape,train_y.shape

(torch.Size([12396, 784]), torch.Size([12396]))

The problem here is that train_x and train_y don’t match in terms of shape.
Here we will use unsqueeze() method to train_y which will return rank 2 tensor:

train_y= train_y.unsqueeze(1)
train_y.shape

torch.Size([12396, 1])

Here we create dset dataset for training by ziping both dependent and independent variables.

dset = list(zip(train_x,train_y))
x,y = dset[0]
x.shape,y

(torch.Size([784]), tensor([1]))

Go through the same steps for validation set

valid_x = torch.cat([valid_ten_3, valid_ten_7]).view(-1, 28*28)
valid_y = tensor([1]*len(valid_ten_3) + [0]*len(valid_ten_7)).unsqueeze(1)
valid_dset = list(zip(valid_x,valid_y))

Now we begin the 7 steps we saw earlier but now for the mnist model
### 1-INIT the parameters

def init_params(size, std=1.0): return (torch.randn(size)*std).requires_grad_()

Initialize weights

weights = init_params((28*28, 1))
weights.shape

torch.Size([784, 1])

Initialize the bias

bias = init_params(1)
bias

tensor([0.6863], requires_grad=True)

2-Prediction calculation

Now we can calculate the prediction based on those randome weights and biases for one image[0]
In order to multiply 2 matrix you need to make sure they have same number of inner product:
- number of columns first matrix==number of rows second matrix

(train_x[0]*weights.T).sum()+bias

tensor([20.2336], grad_fn=<AddBackward0>)

In python matrix multiplication is represented by @
Here we create a function that return our training set as matrix multiplied @ by weights then added to the bias radome value

def linear1(xb): return xb@weights + bias
preds = linear1(train_x)
preds

tensor([[20.2336],
        [17.0644],
        [15.2384],
        ...,
        [18.3804],
        [23.8567],
        [28.6816]], grad_fn=<AddBackward0>)

Check the accuracy of our modl based on random weights

# Let's check our accuracy
corrects = (preds>0.0).float()==train_y
corrects

tensor([[ True],
        [ True],
        [ True],
        ...,
        [False],
        [False],
        [False]])

# check the mean
corrects.float().mean().item()

0.4912068545818329

Now let’s see what the change in accuracy is for a small change in one of the weights (note that we have to ask PyTorch not to calculate gradients as we do this, which is what with torch.no_grad() is doing here):

with torch.no_grad(): weights[0] *= 1.0001

preds = linear1(train_x)
((preds>0.0).float() == train_y).float().mean().item()

0.4912068545818329

As we notice even with the change we commited in one of the weights we didn’t observe any change at all in the model accuracy! which rise a problem for our method. we need a formula that can reflexes the changes we commited to the parameters, so we update the parameters in a way that make our predictions better.
The problem with the thresh preds>0.0 is that the gradients are allways equal to zero, because as we know the gradients are calculate as rise over run, and in this case the small change in parameters is rarely could change the prediction from 3 to 7 or vice-versa, so the gradients will allways indicate to 0.
Instead we need a loss function, which when our weights results a slightly better prediction, gives us a slightly better loss function

trgts  = tensor([1,0,1])
prds   = tensor([0.9, 0.4, 0.2])

Here we assume that the trgts is the image we want to predict, 1 for 3’s and 0 for 7’s,
So in this case the model is confident in the first image it’s a 3 with 0.9, and not sure about the second image with 0.4, and way too incorrect in the 3d image with 0.2
The objective is to build a loss function that will calculate the accuracy of the model and returns some metrics that will be updated if we update our parameters

def mnist_loss(predictions, targets):
    return torch.where(targets==1, 1-predictions, predictions).mean()

torch.where is a way of puting specific condition, in other way we can say this:

if targets==1: loss = 1-predictions else: targets==0

torch.where(trgts==1, 1-prds, prds).mean()

tensor(0.4333)

See if the loss will be updated if we change the weights.
- we will pass this tensor [0.9, 0.4, 0.8] and see how to change the prediction from 0.2 to 0.8 will efects the loss function

mnist_loss(tensor([0.9, 0.4, 0.8]), trgts)

tensor(0.2333)

As we see here the loss function gets better when we minimize the distance between the prediction and the target in the third image
The problem that we still face in this method is that we assume that the predictions will allways be between 0 and 1
To solve this problem we need a function that return any prediction how matter bigger that 1 or smaller than 0 to the interval between those 2 numbers
In fact there’s a function that do the same exact thing, we call it SIGMOID FUNCTION

Sigmoid Function

The sigmoid function always outputs a number between 0 and 1. It’s defined as follows:

def sigmoid(x): return 1/(1+torch.exp(-x))

plot_function(torch.sigmoid, title='Sigmoid', min=-4, max=4)

Whatever input we give it to Sigmoid it will always return a number between 0 and 1
Now let’s update the mnist_loss by adding sigmoid to its inputs

def mnist_loss(predictions, targets):
    predictions = predictions.sigmoid()
    return torch.where(targets==1, 1-predictions, predictions).mean()

SGD and Mini-Batches

Now we have our loss function for the SGD, we need to know which method we will take in order to update the gradients, we can update them after taking all the data points or after calculating each data points.
The first method will take a lot of time, and the second will not use much information, so we will take another track which is to use Mini-Batches and update the gradients after one mini-batch
How many datapoints in each batch is called batch-size. a larger batch size will produce more accurate result but it will take much time, and smaller batch-size will need many epochs to learn but it will be faster
We will see later how to decide the suitable batch-size for each situation
We will use DataLoader in order to shuffle data items before we create the mini-batches, so we vary the data items in each of the mini batches

#for example
c = range(15)
dl = DataLoader(c, batch_size=5, shuffle=True)
list(dl)

[tensor([ 3, 12,  8, 10,  2]),
 tensor([ 9,  4,  7, 14,  5]),
 tensor([ 1, 13,  0,  6, 11])]

But in training model scenarios we won’t need any pyhton collection, we want a collection containing dependent and independent variables.
- A tuple that contains dependents and independent variables called in Pytorch a Dataset
Here is a simple dataset:

ds = L(enumerate(string.ascii_lowercase))
ds

(#26) [(0, 'a'),(1, 'b'),(2, 'c'),(3, 'd'),(4, 'e'),(5, 'f'),(6, 'g'),(7, 'h'),(8, 'i'),(9, 'j')...]

When we pass a Dataset to a DataLoader we will get back mini-batches which are themselves tuples of tensors representing batches of independent and dependent variables:

dl = DataLoader(ds, batch_size=6, shuffle=True)
list(dl)

[(tensor([17, 18, 10, 22,  8, 14]), ('r', 's', 'k', 'w', 'i', 'o')),
 (tensor([20, 15,  9, 13, 21, 12]), ('u', 'p', 'j', 'n', 'v', 'm')),
 (tensor([ 7, 25,  6,  5, 11, 23]), ('h', 'z', 'g', 'f', 'l', 'x')),
 (tensor([ 1,  3,  0, 24, 19, 16]), ('b', 'd', 'a', 'y', 't', 'q')),
 (tensor([2, 4]), ('c', 'e'))]

Putting It All Together

Now we have :
* dset, valid_dset * linear1 * mnist_loss * sigmoid_function

we can create the model from scratch

First we re-intialize our parameters:

weights = init_params((28*28,1))
bias = init_params(1)

Then create the DataLoader from Dataset

dl = DataLoader(dset, batch_size=256)
xb,yb = first(dl)
xb.shape,yb.shape

(torch.Size([256, 784]), torch.Size([256, 1]))

We’ll do the same for the validation set:

valid_dl = DataLoader(valid_dset, batch_size=256)

Let’s create a mini-batch of size 4 for testing:

batch = train_x[:4]
batch.shape

torch.Size([4, 784])

Call linear1 model we created earlier on the batch:

preds = linear1(batch)
preds

tensor([[-2.1876],
        [-8.3973],
        [ 2.5000],
        [-4.9473]], grad_fn=<AddBackward0>)

Create the loss function:

loss = mnist_loss(preds, train_y[:4])
loss

tensor(0.7419, grad_fn=<MeanBackward0>)

Now we can calculate the gradients:

loss.backward()
weights.grad.shape,weights.grad.mean(),bias.grad

(torch.Size([784, 1]), tensor(-0.0061), tensor([-0.0420]))

Let’s put that all these steps in a single function:

def calc_grad(xb, yb, model):
    preds = model(xb)
    loss = mnist_loss(preds, yb)
    loss.backward()

And test it..

calc_grad(batch, train_y[:4], linear1)
weights.grad.mean(), bias.grad

(tensor(-0.0121), tensor([-0.0840]))

But look what happens if we call it twice:

calc_grad(batch, train_y[:4], linear1)
weights.grad.mean(),bias.grad

(tensor(-0.0182), tensor([-0.1260]))

The gradients have changed!
- in order to not calculate the gradients and add to the last one every time we call this function calc_gradwe need to use grad.zero_ which set the current gradients to zero

weights.grad.zero_()
bias.grad.zero_();

Our only remaining step is to update the weights and biases based on the gradient and learning rate. When we do so, we have to tell PyTorch not to take the gradient of this step too—otherwise things will get very confusing when we try to compute the derivative at the next batch! If we assign to the data attribute of a tensor then PyTorch will not take the gradient of that step. Here’s our basic training loop for an epoch:

def train_epoch(model, lr, params):
    for xb,yb in dl:
        calc_grad(xb, yb, model)
        for p in params:
            p.data -= p.grad*lr
            p.grad.zero_()

Let’s build a function that calculate the batch accuracy

# accuracy of the batch
(preds>0.0).float() == train_y[:4]

tensor([[False],
        [False],
        [ True],
        [False]])

Function to calculate our validation accuracy:

def batch_accuracy(xb, yb):
    preds = xb.sigmoid()
    correct = (preds>0.5) == yb
    return correct.float().mean()

# check if it works
batch_accuracy(linear1(batch), train_y[:4])

tensor(0.2500)

# all together
def validate_epoch(model):
    accs = [batch_accuracy(model(xb), yb) for xb,yb in valid_dl]
    return round(torch.stack(accs).mean().item(), 4)

validate_epoch(linear1)

0.5262

Now lets train for one epoch and see if the accuracy improve

lr = 1.
params = weights,bias
train_epoch(linear1, lr, params)
validate_epoch(linear1)

0.6663

Thats promising
- from 0.4642 to 0.6096 after one epoch
Then do a few more:

for i in range(20):
    train_epoch(linear1, lr, params)
    print(validate_epoch(linear1), end=' ')

0.8265 0.89 0.9183 0.9276 0.9398 0.9467 0.9506 0.9525 0.9559 0.9579 0.9599 0.9608 0.9613 0.9618 0.9633 0.9638 0.9647 0.9657 0.9672 0.9677

Creating an Optimizer

As we know the model we created is only for learning .
- in real world scenarios we do not need to implement everything from scratch, framworks like Pytorch and Fastai provide us with everything.
The linear1 model we created can be remplaced with nn.Linear which can do the same work and more
- nn.Linear combine the role of linear1 and weights+bias

linear_model = nn.Linear(28*28, 1)

Every PyTorch module knows what parameters it has that can be trained; they are available through the parameters method:

w, b = linear_model.parameters()
w.shape, b.shape

(torch.Size([1, 784]), torch.Size([1]))

We can create optimizer class

class BasicOptim:
    def __init__(self,params,lr): self.params,self.lr = list(params),lr

    def step(self, *args, **kwargs):
        for p in self.params: p.data -= p.grad.data * self.lr

    def zero_grad(self, *args, **kwargs):
        for p in self.params: p.grad = None

# passing the model params to the optimizer
opt = BasicOptim(linear_model.parameters(), lr)

# simplifying the epoch training function
def train_epoch(model):
    for xb,yb in dl:
        calc_grad(xb, yb, model)
        opt.step()
        opt.zero_grad()

# validation doesn't change
validate_epoch(linear_model)

0.4606

# simplifying the training loop:
def train_model(model, epochs):
    for i in range(epochs):
        train_epoch(model)
        print(validate_epoch(model), end=' ')

# same results as before
train_model(linear_model, 20)

0.4932 0.7686 0.8555 0.9136 0.9346 0.9482 0.957 0.9634 0.9658 0.9678 0.9697 0.9717 0.9736 0.9746 0.9761 0.9771 0.9775 0.9775 0.978 0.9785

The good thing is all of this is provided by Fastai:

linear_model = nn.Linear(28*28,1)
opt = SGD(linear_model.parameters(), lr)
train_model(linear_model, 20)

0.4932 0.8179 0.8496 0.9141 0.9346 0.9482 0.957 0.9619 0.9658 0.9673 0.9692 0.9712 0.9741 0.9751 0.9761 0.9775 0.9775 0.978 0.9785 0.979

Fastai also provides Learner.fit, which we can use instead of train_model. To create a Learner we first need to create a DataLoaders, by passing in our training and validation DataLoaders:

dls = DataLoaders(dl, valid_dl)

To create a Learner we need to pass in all the elements that we’ve created in this chapter: the DataLoaders, the model, the optimization function (which will be passed the parameters), the loss function, and optionally any metrics to print:

learn = Learner(dls, nn.Linear(28*28,1), opt_func=SGD,
                loss_func=mnist_loss, metrics=batch_accuracy)

Now we can call fit:

learn.fit(10, lr=lr)

epoch	train_loss	valid_loss	batch_accuracy	time
0	0.636709	0.503144	0.495584	00:00
1	0.429828	0.248517	0.777233	00:00
2	0.161680	0.155361	0.861629	00:00
3	0.072948	0.097721	0.917566	00:00
4	0.040128	0.073205	0.936212	00:00
5	0.027210	0.059466	0.950442	00:00
6	0.021837	0.050799	0.957802	00:00
7	0.019398	0.044980	0.964181	00:00
8	0.018122	0.040853	0.966143	00:00
9	0.017330	0.037788	0.968106	00:00

Adding a Nonlinearity

So far we managed to create a linear function that can predict hand written digit with high performance.
But still, it’s just a simple linear classifier, with very constraint abilities.
To make it more complex and capable handle complex tasks, we need to add something non-linear between two linear classifiers.
- This is gives us a Neural network
Here a basic architecture of a neural network:

def simple_net(xb): 
    res = xb@w1 + b1 # first linear function
    res = res.max(tensor(0.0)) # non linear function
    res = res@w2 + b2  # 2nd linear function
    return res

w1 = init_params((28*28,30))
b1 = init_params(30)
w2 = init_params((30,1))
b2 = init_params(1)

w1.shape, b1.shape

(torch.Size([784, 30]), torch.Size([30]))

the lines: res = xb@w1 + b1 and res@w2 + b2 are two classifier, basic linear function similar to the nn.Linear we use previously
While the line res = res.max(tensor(0.0)) is a non linear function
- this function called rectified linear unit ReLu which return every negative number with 0
So if we think about this architecture we have here:
- first we have a linear function that does the matrix multiplication between dataset tensors and initialzed weights + bias and output 30 (we can chose any number) features, which represent for each a different mix of pixels
- these outputs are taken by the ReLU and converted to 0 if they are negativem and x==y if it not, then output also 30 features to the next linear layer, which will do the sam computaion as the first and output the results
In pythorch there’s a modul that fit our neural nets here: nn.Sequential().
- take results from a layer to another
we also replace the linear function we built by nn.Linear and max((0.0)) with nn.ReLU()

simple_net = nn.Sequential(
    nn.Linear(28*28,30),
    nn.ReLU(),
    nn.Linear(30,1)
)

#lets try the model again but now with 2 layers
learn = Learner(dls, simple_net, opt_func=SGD,
                loss_func=mnist_loss, metrics=batch_accuracy)

learn.fit(40, 0.1)

epoch	train_loss	valid_loss	batch_accuracy	time
0	0.333021	0.396112	0.512267	00:00
1	0.152461	0.235238	0.797350	00:00
2	0.083573	0.117471	0.911678	00:00
3	0.054309	0.078720	0.940628	00:00
4	0.040829	0.061228	0.956330	00:00
5	0.034006	0.051490	0.963690	00:00
6	0.030123	0.045381	0.966634	00:00
7	0.027619	0.041218	0.968106	00:00
8	0.025825	0.038200	0.969087	00:00
9	0.024441	0.035901	0.969578	00:00
10	0.023321	0.034082	0.971541	00:00
11	0.022387	0.032598	0.972031	00:00
12	0.021592	0.031353	0.974485	00:00
13	0.020904	0.030284	0.975466	00:00
14	0.020300	0.029352	0.975466	00:00
15	0.019766	0.028526	0.975466	00:00
16	0.019288	0.027788	0.976448	00:00
17	0.018857	0.027124	0.977429	00:00
18	0.018465	0.026523	0.978410	00:00
19	0.018107	0.025977	0.978901	00:00
20	0.017777	0.025479	0.978901	00:00
21	0.017473	0.025022	0.979392	00:00
22	0.017191	0.024601	0.980373	00:00
23	0.016927	0.024213	0.980373	00:00
24	0.016680	0.023855	0.981354	00:00
25	0.016449	0.023521	0.981354	00:00
26	0.016230	0.023211	0.981354	00:00
27	0.016023	0.022922	0.981354	00:00
28	0.015827	0.022653	0.981845	00:00
29	0.015641	0.022401	0.981845	00:00
30	0.015463	0.022165	0.981845	00:00
31	0.015294	0.021944	0.983317	00:00
32	0.015132	0.021736	0.982826	00:00
33	0.014977	0.021541	0.982826	00:00
34	0.014828	0.021357	0.982336	00:00
35	0.014686	0.021184	0.982336	00:00
36	0.014549	0.021019	0.982336	00:00
37	0.014417	0.020864	0.982336	00:00
38	0.014290	0.020716	0.982336	00:00
39	0.014168	0.020576	0.982336	00:00

At this point we did everything we can to improve the model performane, and we got an accuracy of 982826, which is very solid number.
from here on, all we can do is looking inside our model and understand the mechanic of it in each steps

# we can see the architecture of the model
m =learn.model

As expected the model architecture contains 2 layers and a non-linear function in-between.

Sequential(
  (0): Linear(in_features=784, out_features=30, bias=True)
  (1): ReLU()
  (2): Linear(in_features=30, out_features=1, bias=True)
)

We could also see what did the model learn in each layer

w, b= m[0].parameters()

show_image(w[22].view(28,28))