contructingnablac — part 2 of 2

Source: public/assets/writing/mathematics/contructingnablac.docx · mathematics
However, we need to compute the full gradient: we need to also the other derivatives with respect to all the other many weights and biases in the network.

Let’s find out how to compute the bias in our output layer next. 

THE BIAS CALC

A nice and easy exercise at this point, as luckily enough, the sensitivity of the cost function to a tweak in the bias is almost exactly the same as our equation for a change to the weight:

All we do is replace the  with a . Even more luckily this new derivative is just 1 because we treat the others as constants and then take the derivative of itself (a beautifully easy differentiation of just x + n) resulting in 1):

So we get our second gradient entry even more simply than we do the derivative for the last weight.

THE PREVIOUS W+B CALC

Now we know how changes to the weight and bias in our output layer in our super simple neural network will change the overall cost, we already have 2 entries to our gradient vector. But we still need the derivatives of the other weights and biases that come before them.

All the other weights and biases lie earlier on in our neural network, which means they have less direct influence on the cost. The way we deal with them is by figuring out how sensitive the cost is to the neuron in the second to last layer , and then finding our how sensitive that one is to all the preceding weights and biases.

Probably unsurprisingly, that looks very similar to what we’ve already seen:

To solve this modified version we just need to solve , which is easily enough done the same way we found . Get rid of bias, use product rule to find that:

Which just tells us that when the activation in the second to last layer is changed, the effect it has on  will be proportional to the weight  that it’s connected by. That definitely makes sense, albeit tautologically, considering all a weighting does is tell us how much the effect of the activation should be.

But we don’t care about how changing  effects  because we can’t change it. What we should care about, is the weights and biases that make it up. Remember, the activation in the previous layer is made up of its own weights and biases, and in turn, another activation:

This is where propagating backwards can come in handy. Though we can’t directly change the activation, we can keep on iterating this use of the chain rule backwards to find how sensitive the cost function is to all of the different weights and biases.For example, we will be able to compute the long chain of steps between  and  simply by breaking it all down into bite sized intermediate steps using our chain rule propagated backwards:

The way one might read this is by tracing their way back through the tree I showed. When we ask, how sensitive is  to a change in  we answer by saying: let’s see how sensitive  is to tweaks in , then how sensitive  is to , etc. until we reach where we ask how sensitive  is to .

By tracking with the constituent parts throughout the tree and multiplying our list of partial derivatives, we now found ourselves with a way to calculate any weight or bias in our entire neural network. In fact, all we are doing is applying the same simple chain rule use case that we’ve been utilising the whole time! Using this backpropagation technique, we can now calculate the entire gradient vector. That’s a wrap! Well… at least for this super simple version of a neural network.

MORE COMPLEX NETWORKS

The good news that you might find unbelievable is that it doesn’t get all too more complex for a real neural network. This single neuron-per-layer network taught us all the calculations and math we need for larger-scale networks like our snack classification task. Nothing really changes except a couple more indices that we add on to keep track of everything.

Now, rather than an activation a layer being , we’ll also give it a subscript to indicate which neuron of the layer it is (as now our neurons have multiple layers like in the full neural network I showed before). The first in the output layer for example would be , the second in the output layer would be , and the third . Likewise in the previous layer is would be , , etc. etc.

Just one final push in terms of calculus and remembering notation, I promise. Let’s use the letter k to index the  layer and the letter j to index the  layer. 

To find the cost, we still need to sum (add up) the squares of the differences between the last layer activations and our desired outputs. More formally, this is means: sum  or even more formally:

Each weight also needs to have a couple more indices to keep track of which one it is so let’s say  would mean the weight of the edge that connects the kth neuron in the  layer to the jth neuron in the  layer.

Now we can still call our relevant weighted sum something like :

(this is example shows only three neurons in the second to last layer, but this can be done with any number of neurons)

And we still also wrap this  up with a function like σ to get our final activation.

These equations boil down to the same as what we had in our single neuron per layer model. For example, the chain rule derivative of how sensitive cost is to a weight in the last connection layer is essentially the exact same.

The only difference here is that we keep track of a couple more indices which help tell us which weight we’re talking about out of all the different weights connecting the second to last layer with the output layer.

The only thing that we have to pay attention to is the derivative of the cost with respect to any sort of activation in previous layers, such as an activation in the layer : 

We should keep in mind here that this activation influences the cost function through multiple paths. For example, changing this activation will not only impact the first output neuron, but also change the second output neuron, and the third output neuron, it in fact impacts all of the output neurons due to its connections with them (and all of them will affect the cost function themselves). This means that in calculating the impact of tweaking an activation in the second to last layer on the cost function we need to consider how it impacts all of our final output neurons and it’s influence on them.

The way to deal with calculating this cost shift over multiple pathways is simple. We just add up multiple chain rule expressions to correspond to each pathway that the activation will impact the cost function and add them up to find the cost function’s sensitivity:

Once we can know how sensitive the cost function is to activations in this second to last layer, we also use this process in all the backpropagation we might do for the weights and biases feeding into this second to last layer. Once we can do that, we know everything necessary mathematically to construct our very own image classifier.

CLASSIFIER

From the backpropagation calculus we learnt, we can then calculate our gradient:

With that, we can perform the abstract idea of gradient descent we discussed previously concretely using the negative gradient of the cost function. All I need to do now to classify snacks is feed a ton of training images into this math and eventually with enough iterations of gradient descent, I should get a decent image classifier (and the same principle applies for anything).