contructingnablac — part 1 of 2

Source: public/assets/writing/mathematics/contructingnablac.docx · mathematics

General Introduction:In this post I will be attempting to explain the entire basics of the mathematics behind neural network based machine learning comprehensively, both only using math commensurate to a high-school level.

This doesn’t go over the relevant coding libraries, and is not a coding tutorial, though it is more useful than one for building the relevant mathematical and computational (under the hood) understanding. Such tutorials are commonplace and produced at a high level. The point of this series is to provide low level concrete mathematical understanding. I have written it with some level of the software aspect in mind though, so it will be easy for someone who already understands the coding side of it to grasp, or otherwise grasp the software side after reading this.

My main goal here is to provide an easy way for those not acquainted with higher-level math to understand the “concrete” mathematical basis for neural networks, instead of having to rely on higher level abstractions.

A secondary goal is to demonstrate how relatively simple math, if applied with care and clever iteration, can create wondrously useful things, such as a system that (seems to) "think" to complete tasks much like a human brain does.

Section 1: What are Neural Networks anyway? And what do we want to use them for?

Introduction:

The example:

I want to set up a little food recognition camera to find out if the afternoon food in the boarding house (where I stay)’s kitchen is either chocolate, crisps, or pizza, to determine whether it’s worth the 30 second walk. Yes, a very pressing task, but the same thing is done with images of roads: self-driving cars anyone? The same thing can also be done with word pairings, which is how LLM’s like ChatGPT work, albeit with a little more architecture called transforms. Nonetheless the main point for now is that this is a basic image classification task where I want my math to differentiate (pun intended) between pizza, chocolate or crisps. The hardware part of this is not going to be discussed, but the mathematics underlying any software that might be used will be the focus of this series. So the goal is to feed in an image of the Johnson’s snack and then get our mathematical model to “tell” us what type of snack it is. In this series, we will look at how a basic perceptron is made and learns, specifically, neural networks and backpropagation calculus.

Recognising images might seem easy at first, especially for us who are amazing at doing stuff like that, but writing the code for a model that can do that is actually incredibly difficult. Try and describe how you would write the code for a machine to take in pixels and recognise if it was pizza or not with high accuracy. The task suddenly seems almost impossible, especially with high school mathematics, as computers can’t just do abstract reasoning in their code… or can they?

If we write code that mimics our brain and how that works, perhaps we can tackle the fuzzy and abstract reasoning problems we are so good at taking on. That’s the rationale behind neural networks, which aren’t actually told to do anything according to a set of rules like normal code, but instead “learn” by us giving them a number of examples and learning from them, which is called training.

Firstly though, before we can get to training our neural network, we need to know what they are anyway, and how one looks and works as a mathematic model. There are many types of neural networks, such as transformers and Convolutional Neural networks, but the easiest way to understand and learn is by constructing a basic model with no unnecessary features.

Constructing a simple perceptron:

When I say neuron, I just need you to imagine it’s a little cell that contains a number between 0 and 1. A neural network is actually just a bunch of these little cells (and their numbers) clumped together. The little number inside each neuron is called the activation of the neuron and represents how strongly that neuron “fires.”

Every bit of information in our project will actually be represented as these little numbers between 0 and 1 in our neurons, from our inputted picture to the output that classifies the image. First of all, how do we represent a picture mathematically? Well each picture I take on my phone is 4000 pixels (a little square that holds a colour made of red green and blue, that when put together with many others can form a recognisable image) high and 3000 pixels wide. So in total we have 4000x3000 pixels, which is a whopping 12 million pixels. Accounting for the fact that each pixel has a value between 0 and 1 for its red value, blue value and green value, each pixel actually has three values per single one, meaning w have to multiply that input number by 3. Which make 36 million different little numbers between 0 and 1.

Here's a useful way to think about it: We slice up the image into a bunch of different tiny blocks and state their colour as a combination between Red, Green and Blue (RGB), so we can say if the square (is part of the mozzarella on pizza for example) contains 100% red and 100% green and 100% blue which would be a value of 1 for each and result in white, or some other combination which can make any colour. We can take this little bundle of colour data and represent it mathematically with 3 numbers between 0 and 1. Do this for every single pixel of the 12 million and we have a mathematical representation of an image.

When we want to feed out network an image, we just set the layer of input neurons to be the colour data/pixels of the image we want to feed into it. The last layer of our network will be 3 neurons each representing one of the snack types. How hard these neurons fire, a number between 0 and 1, will represent how confident the neural network is that the snack is of that type (i.e. a firing of the first neuron, representing pizza will mean the system 100% thinks it’s an image of pizza).

But how does information move throughout the layers of the neural network?

Well, to find a neuron in the second layer, what we do is take all the neuron activations in the previous layer, and add them up as they are connected to a neuron in the second layer. We give every connection some indicator of how the neuron in the first layer should be related to this new neuron in the second layer, which we call a weight.

If a neuron in the first layer is activated and the weight is positive, this indicates that the neuron in the second layer should also be activated. If the weight is negative, this indicates that the neuron in the second layer should be unactive. Of course, the weights will conflict with each other, but this gives us some numbers that we can tweak and fiddle around with, which is all we need.

Where the subscript is representative of the neuron number and coupled weight in the starting layer. Here, the values of each neuron () are multiplied by their corresponding weight (), representative of how we might find a neuron in the second layer from neurons in the first.

The result of the weighted sum like this can be any number, but for this network we want the activation of the neuron to be values between 0 and 1 so we input it into a sigmoid function to squash it between 0 and 1 (and make it non-linear, but that is irrelevant right now).

Now, it may be useful sometimes for us to have a threshold boundary variable, which we can tell the neuron to keep firing unless the weighted sum gets big enough. So we add bias (), representing a threshold boundary the weighted sum needs to be greater than before the neuron becomes negative (and will not “fire” a signal) and vice versa where if the bias is negative we need our sum to be greater than the bias before it will fire.

All together we can represent our knowledge right now like this:

Where a is a neuron: where the superscript is the layer number, and the subscript is the neuron number in that layer.

Where w is a weight: where the first subscript represents the neuron number in the current layer, and the second subscript represents the neuron number in the previous layer the weight is connected from.

This is the formula to calculate the first neuron in the second layer from the neurons of the former layer and according to weighted connection with this neuron, along with bias.

We’re not going to compute each weighted sum and neuron one at a time, instead we can use matrix multiplication to calculate the activations of every neuron in the next layer all at once.

using the previous formula we derived:

First we wrap up all the activation values of all the neurons from the first layer into a matrix, you can think of this vector as representing the input layer, containing each neuron and their values.

Then we organize all the weights as a matrix, where each row contains the weight of every connection between neurons in the first layer and a particular neuron in the next layer (the first row contains all the weight for the neuron we looked at before, the second row has all the weights for and so on). You can think of this representing all of the different lines/connections between neurons in the first (input) layer and second layer that all contain weights.

Using matrix multiplication, we can calculate the matrix as a column vector, which represents the result of every neuron combined with its corresponding weight (aka the weighted sum). So this matrix represents every weighted sum in the next layer. The result of this would be a column vector that contains all the weighted sums of the first layer, but the numbers would not have an activation threshold (not yet added bias bias) nor be between 0 and 1 (not yet passed through sigmoid)

Before we conclude that we’ve calculated the next layer however, we need to first add bias to make sure we have our activation thresholds.

Finally, wrap it up in a sigma to squish the final column vector output so each result is between 1 and 0.

Not only do we add bias to the 32nd neuron and squish it down between 0 and 1 but apply bias and sigmoid squashing to every single neuron in the second layer, which gives us the following expression:

is a small but powerful expression that represents calculating every neuron in the second layer based on all the neurons in the input layer, including accounting for our chosen weights and biases.

This expression can actually be used to calculate between any two chosen layers, including in between hidden layers and in between the final hidden layer and our output layer. Using for example would get us the final output layer.

So if we use this little expression multiple times, we can finally give mathematical expression as to how to calculate the output layer from the input layer.

Here's a way to visualise the neural network we’ve discussed so far:

Now you know the mathematics behind neural networks and how this is structured. Each neuron is basically just a function. It takes all the numbers from the past layer of neurons, applies a bias and weighting and then fires a number between 0 and 1. In fact, even though the network is stupidly complex, the whole network is a function. The first layer of the neural network is just a basic input (image of pizza, crisps or chocolate represented mathematically), and the final layer is a simple output (should tell us how confident the machine is that each image is each) and we’ll get to how it “learns” what pizza looks like. It's just incredibly complex as a function. In fact, not only are our inputs changeable parameters, but so are our weights biases remember? So, every single connection is it’s own little variable we can change. That’s (36 million *32) + (32*32) + (32*3) for each weighting, and + 32 +32 + 3 for the biases, which comes to a ridiculous 1,152,001,187 (1 billion+) parameters we can tweak and fiddle around.

Section 2: What is learning for our neural network? Explaining gradient descent and the learning process abstractly.

We want our output from the neural network to be the pizza output if the input is an image of pizza. But right now, when we input our image with a bunch of random weights and biases, the output is just going to be a completely random activation of the final three, aka, complete trash.

We already have the machine part, but here’s where we need the learning of “machine learning”. What do we mean by "learning?" Well, we never actually write any instructions for our cute little neural network, so we never actually build any pizza-chocolate-crisps recognising algorithm, this isn’t how it learns, we don’t actually tell it what to do. Instead, we are going to write an mathematical algorithm that will take in a bunch of example images of pizza-chocolate-crisps along with labels for what they are (i.e. a picture of pizza is pre-labelled pizza) and then adjust the weights and biases of our network to make it perform better. So in this way we “teach” the model and it “learns.”

However, although this might seem like magic, once we get down and dirty with the maths, it will probably feel a little less fantasy and a little more like a calculus exercise. In the end, what we mean by learning is actually just as simple as finding the minima of a certain function.

But first of all, how do we communicate with the computer what we want it to give us and what we think is literally hot garbage? The way we do this is through a “cost function,” which represents how incorrect the model’s output is.

To represent this more mathematically, what we do is add up the squares (so we get absolute values of error) of the differences between each of the garbage output activations and the values you want them to have. For example, if I stick in a picture of Pizza, and it results in Pizza = 0.2, Chocolate = 0.8, Crisps = 0.1, and I want Pizza = 1, Chocolate = 0, Crisps = 0 (which represents 100% certainty of pizza and 100% certainty of NOT chocolate or crisps), then my “cost” would be = 1.29. But, we’re interested in a specific neural network (the weights and biases performance) on all types of images, not just one specific one about pizza. So what we do is we test it out with a couple hundred or thousand more images and take the average cost. That’s how garbage/useful our model is.

Back to the matter at hand, which is learning. We want to tweak the huge number of variables in such a way that it maximizes telling us what we want to hear, i.e. this is pizza. We now have a way to mathematically represent what we don’t want, and therefore (as the minima of what we don’t want is what we want) what we want. Imagine all the different average cost functions of all the different weights and biases out on a graph, each point representing the performance of a neural network and therefore it’s corresponding weights: to find the best neural net that classifies images for us, we just have to pinpoint the minima on our graph of average cost functions/neural nets.

The intuition of a calculus student here is probably going to be: “just solve for when the slope is zero on the cost function, problem solved, we all go home… right?” Unfortunately, it’s not so simple. The intuition is right, but unfortunately with our neural nets you can't find the minimum explicitly by solving for when the slope is zero, because it's a really complicated function with a huge number of parameters. Instead, we use something called “Gradient descent” to find a minima.

Gradient descent is just what it sounds like on the tin. If we want to find a minima, it seems logical we just go lower and lower and “descend” down the gradient/slope of our average cost functions graph. If you image a normal graph, one way to find a low point is by “rolling” a ball down a slope, where gravity will roll it into a valley. Now, with a graph with many peaks and valleys, it’s true there will be multiple minima, and so we won’t necessarily solve perfectly. although there are ways to find lower and lower ones possibly, a local minima will suffice for our purposes.

So how do we “roll” down the slope? Well, we just randomly nudge the weights and biases of the neural network to see how it effects the output of the cost function (how “correct” it is). Imagine we push it right a little bit: If the ball goes down, we keep pushing it right; If the ball goes up, we push it left a little bit, if the ball goes up on both sides, we’ve found a valley (local minima).

However, our function can’t be represented this simply in 2 dimensional space, but moving it up to 3 dimensions would just mean the exact same thing of rolling the ball down down to a local minima, just the peaks and valleys look more like a topographic map than a 2d graph, which is fine conceptually, we just need a little more math to represent what we’re thinking about when we roll ball down some hills. It doesn't really make sense to talk about slope now in terms of a single number, what we need to do instead is use a vector, which can contain multiple values, and standard in this 3d space is to use that vector to represent the path we move in that is steepest and call it the gradient. Imagine you’re flying above some hills, we need to say where people should move in an x and y, that will get them to go up the highest after moving. So if the gradient is the uphill direction, obviously the downhill direction will be the negative gradient. Don’t worry if you’re slightly confused, this is multivariable calculus, but all that really matters for you to understand is that we can find this vector, and it tells you both where the “downhill” direction is, and on top of that how steep it goes down. Remember, this space represents our neural network’s possible weights and biases, so going down means a better performing perceptron.

Now, instead of 1 input and one output (x, y: aka 2 dimensional space), or 2 inputs and one output output (x, y, z: aka 3 dimensional space), we’re going to have a lot more things we can change, all 1 billion+ parameters making up our inputs and our cost function being the output again. Don’t worry if this sounds scary, as obviously it’s going to be impossible to visualise something like this. Here’s a better way of thinking about it in high dimensional space: remember how we have all those 1152001187 different variables to tweak? Represent those as a column vector with all their numbers, and then think about what the negative gradient would be. It would just be a column vector, equal in size, with each number in it correspondent to our variables, telling us how we should change them to “go downhill” (which is again making a better neural network). For example, the negative gradient of the cost function might look like:

So by now all the relevant ideas about the cost function and it’s purpose can be summarised as the following: The negative gradient of the cost function is a n-dimensional vector that tells us how to nudge all the weights and biases to decrease the cost function and therefore improve our neural network. That should hopefully make sense with all of the relevant terminology explained previously.

If you get that, then you can understand, calculating this gradient vector is how we’re going to figure out how to improve our neural network, and finally, get a piece of maths that can recognise and classify images (like my beloved snacks at Johnsons). The way we calculate this negative gradient (efficiently) is through “backpropagation,” which is some nifty calculus we’re going to explore in this next section right now.

Section 3: So how does our neural network learn? Explaining gradient descent through backpropagation, and then quantitively describing backpropagation calculus.

First of all, before we dive into the maths, I want to give you some conceptual understanding of backpropagation with complete disregard for notation, which gets complex if we don’t go through the intuitive aspect of it. Remember, the negative gradient of the cost function is a n-dimensional vector that tells us how to nudge all the weights and biases to decrease the cost function and therefore improve our neural network. Backpropagation is simply an algorithm which calculates that negative gradient.

SINGLE EXAMPLE

Right now let’s think about a single example of “training” out model; we provide our model with a picture of pizza and tell it that we want it to tell us it’s pizza by making the cost function the difference between 1, 0, 0 (remember pizza is our first neuron) and whatever numbers it provides us with from it’s completely trash random weights and biases. We obviously want to tweak the weights and biases in a way to change the output that it correctly lights up with a 1, 0, 0 at the output.

Although we can’t control the activations in any neuronal layer directly, we can keep track of the changes we want to be made and then think about how we can change them. For example, if our random model gives us 0.1, 0.9, 0.1 and we want 1, 0, 0 we want a big positive push for our first neuron, a big negative push for our second and a small negative push for our third; Since we want the network to classify this as a pizza, we want the pizza neuron’s value to be nudged up while the other two get nudged down.

Let’s get into the specifics now. Let’s say we want to nudge up the activation of the pizza neuron from 0.1 to 1. Remember the activation value is simply the weighted sum of activations from the previous layer + a bias (and then passed through a sigmoid squashing, but that’s irrelevant right now).

We can see from this equation 3 different ways of increasing our neuron value: Increasing the bias, increasing the weights, or changing the activations of the neurons from the previous layer.

The bias is the most straightforward way to do this, where we just increase the bias associated with the pizza neuron (and decrease the biases associated with the other two neurons).

The weights is slightly more complex as they are multiplied by the activations from the previous layer, so changing each one by the same amount will have different effects. Because we care about the efficiency of our changes, we focus on increasing the weights between our pizza neurons and the most active (highest value) neurons of the previous layer.

The final way we can increase the pizza neuron’s activation is by changing the activations of the neurons in the previous layer. Specifically, we want to increase the neurons that are connected to our pizza neuron with a positive weight, and decrease the neurons that are connected to our pizza neuron with a negative weight. Again, we cannot directly change these, but we can keep track of the changes we would like to make in a column vector, and that will be useful in just a second.

Remember however these changes are only what our pizza neuron wants to become more active. We need to also take into consideration what the other neurons need to become less active. Each of those other output neurons has its own column vector of requests of what should happen to the second-to-last layer. So we add together all three column vectors of nudges we want made to increase the pizza neuron and lower the chocolate and crisp neurons and although there will be competing desires and it is impossible to perfectly satisfy them, we can add them up to get a overall desired change that will give us the change we need to train the model to identify pizza and identify that it’s not crisps or chocolate.

Here is finally where we get the idea of propagating backwards. When we add all the desired effects, we can get a list of changes we want to happen in the second to last layer. Then, we can apply the exact same process to the relevant weights and biases determining those values, and iteratively repeat this process as we move backwards through our neural network.

REPEATING FOR ALL TRAINING EXAMPLES

Everything I just talked about was an example of how our neural network learns from a single training example of a pizza image with its associated cost (which is what we mean when we say it is “labelled”). That single example results in a particular way we want to tweak our huge number of weights and biases. This is important to note, because so far we are only identifying the tweaks to the weights and biases that would improve our results for identifying pizza. But remember, we also need account for how the chocolate and crisps neurons want to nudge the weights and biases so we can identify them as well. If we only listened to the pizza neuron, it would be incentivised to simply tweak the weights and biases so it only ever resulted in firing the pizza neuron! Obviously, that’s not what we want here.

We need to zoom out a bit here: repeating this same backpropagation for all the other training examples (i.e. multiple different pictures of pizza, chocolate and crisps), every time recording how each example would like to change the weights and biases to improve their correct recognition in that example case. Then we average together all the examples’ desired changes.

As we can see in the diagram, each training example has its own desire for how the weights and biases should be adjusted, and with relative strengths. By averaging together the desires of all the different training examples, we get the final result for how any weight or bias should be tweaked in a single gradient descent step.

If we collect all these averaged desire tweaks for every weight and bias into a column vector of the same size we finally have the negative gradient of the cost function! Though we have not yet quantitatively described these nudges, if you followed along with the steps above (i.e. why some nudges are bigger than others, and how they should all be added together) then you understand what backpropagation is actually doing. Now that you have a intuitive understanding of the backpropagation algorithm, now it’s time to get down and dirty with the formal math and roll up our sleeves with the relevant calculus.

BACKPROPAGATION CALCULUS

First, there’s quite a bit of notation involved, and I want everything to be clear and easy to follow. Up until now I’ve simply just said “negative gradient of the cost function” but that’s pretty wordy for our formulas. So instead I’ll use -∇C where – means negative (obviously), ∇ just means gradient and C means of the cost function. Backpropagation is an algorithm for calculating the gradient of the cost function of a network (∇C; then we can use the gradient a negative or however we like).

What we want to do is show how things like the chain rule and some differentiation can be utilised in the context of neural networks to get us the results we want.

Let’s imagine an extremely simple neural network, likewise with an input layer, and output layer and two hidden layers, but this time just a single neuron in each.

This network is determined by 3 weights and 3 biases, one weight for each connection, and one bias for each neuron other than the input. Our goal here is to understand how changing the different weights and biases effects the cost function. If we know that, we can know which little tweaks will cause the most efficient decrease to the cost.

We’ll get to answering that soon enough. Let’s just focus on the connection between the last two neurons though for now. Let’s label the activation of the output neuron with a superscript O (for output), which tells us what layer it’s in, so the activation of the previous neuron is a(O−1) and so on.

Just for clarity, these superscripts are not exponents at all, don’t get confused! They are just a helpful way of indexing which layer we’re talking about, as subscripts normally refer to something else that we’ll talk about later.

Now a little more clarification of notation. If we have a desired output (y), the cost for this one training example () will be as, remember, the cost function is the square of the difference of what we want and the output.

Let me remind you of something really quickly. This last activation is made up of a weight, a bias, the previous neuron’s activation, and put through a sigmoid.

For the sake of simplification, it’s a lot easier if we assign this weighted sum to a variable.

Here’s a nice way of visualising it. The weight, the bias, and the previous layer’s activation determine , which in turn is transformed into , and therefore lets us calculate (along with our , as that’s how our cost function works.

And just to keep up our intuitions about how these networks work remember is influenced by its own weight and bias and previous activation, so our tree extends backwards a bit more too -

- but we won’t think about that right now.

Remember, all of these are just numbers, so don’t be intimidated at all. One nice way to think about it is each having a little slider that can move freely up and down to change the numbers. Think about what would happen if we dragged up from 1 to 2 for example. Well, it depends on the values of , but assuming both are positive, it should double the input from , thus increasing , thus increasing , and if we are below it should decrease our cost, or if we are above it should increase our cost. But this is an abstract way of doing it – we can actually calculate how changes in impact excitingly!

THE WEIGHT CALC

We will now compute the first derivate we want. We want to know how sensitive is to changes in (how much responds to tweaks in the value of ). Mathematically speaking, we want to know the derivative:

When you see just think of it as meaning “a very tiny change to ” like 0.001. Think about the term as meaning “whatever the change to the cost is from that change.” What we want is how sensitive is to tweaks in , meaning their resulting ratio.

Don’t be intimidated by my use of partial differentiation here (denoted by ) please. Partial differentiation is what we do when we are dealing with multiple variables, as normal differentiation won’t work. It’s as simple as treating the other variables as constants and then differentiating. Now Let’s move back to how we can find how much a shift in impacts .

Conceptually speaking, impacts , thus impacting directly influencing the cost . We need to calculate this chain of events to calculate . We break this chain of events down simply. First, we see how tweaks in impacts by taking the ratio of a tiny change to to the tiny change in ; We want the derivative of with respect to . Similarly, we consider how sensitive is to tweaks in (meaning the ratio of the two again), and of course likewise the ratio between the final change to from that tweak to . Altogether those sensitivities combine to make us our overall sensitivity of how sensitive is to a shift in . Mathematically expressed it is:

That’s a lot of seemingly complex notation for a simple concept! But don’t worry, if you followed along to what I just said you already understand this!

Remember, just means how sensitive is to a shift in , and is what we need to find. To find this we need the sensitivity of the different moving parts in the chain of steps in between, right? Well, tells us how sensitive is to a shift in , tells us how sensitive is to a shift in , and finally tells us how sensitive is to a shift in .

Put together, all means is the sensitivity of the different moving parts in the chain of steps in between, and therefore, how sensitive is to a shift in . Hopefully that should be easy to follow now. The astute will notice that actually what we’re using here is the chain rule. Now it’s as simple as 1, 2, 3 (differentiation steps to break it down into it’s constituent derivatives)!

Now we’ve pieced apart , the next step is to compute the values of the derivatives that comprise it. We’ll use some formula we’ve already talked about in our definitions of our neural networks.

For now let’s deal with the first derivative: with respect to (). Treating as a constant means it just disappears (it’s derivative with respect to is 0). For the term we can use the product rule:

where and , and because we’re taking the partial derivative with respect to it. So, we have

And as is treated as a constant we can say

And as the derivative of a variable with respect to itself is 1 we can also say

And thus:

Which finally all together tells us that

This is the first of the three constituent derivatives we want, and also the hardest one to calculate (even though it’s actually quite simple).

The next two are as simple as

and due to the power rule:

Pause and take a moment to reflect on what these actually show. The first derivative tells us that changing the weight has a greater effect on when the activation from the previous layer that is connected to it is stronger. Also, the final derivative tells us that if the actual output we get is highly different than what we want it to be, then the activation will incur marginally larger differences to the cost.

now, putting this all together in the context of

we get that

The good news is that we did it! This formula tells us how a tweak in that one weight in the last layer will affect the cost! However, we have to remember, this is for that *one particular training example*. To get the entire gradient vector, we’re going to have to calculate a whole lot more.

However, the rest of the work in this section at this point is literally just taking averages. So don’t despair. Now we figure out how to get the cost for all training data.

The astute will have noticed that has a little subscript, 0. That’s because it’s the cost of a single training example. The full cost function, (rather banally and obviously), is the average of all the individual costs of each training example:

Don’t worry about the , that just means the average of all the different (individual costs of each training example), by adding all the costs together and dividing by number of costs (aka taking the average). Likewise, the derivative of with respect to the weight is the average of all the individual derivatives.

Now this expression is the real deal – it tells us how the overall cost of our neural network will change when we tweak the last weight.

Now, if you recall, each entry of the gradient vector is a partial derivative of our cost function with respect to a weight or bias in the network (and the whole gradient vector is the partial derivative of cost with respect to every single weight and bias). This means that is one of our entries in the gradient vector! Specifically, it is the last weight (the weight for the output layer).