10 min read

In this article by Gianmario Spacagna, Daniel Slater, Phuong Vo.T.H, and Valentino Zocca, the authors of the book Python Deep Learning, we will learnthe Backpropagation algorithmas it is one of the most important topics for multi-layer feed-forward neural networks.

(For more resources related to this topic, see here.)

Propagating the error back from last to first layer, hence the name Backpropagation. Backpropagation is one of the most difficult algorithms to understand at first, but all is needed is some knowledge of basic differential calculus and the chain rule. For a deep neural network the algorithm to set the weights is called the Backpropagation algorithm.

The Backpropagation algorithm

We have seen how neural networks can map inputs onto determined outputs, depending on fixed weights. Once the architecture of the neural network has been defined (feed-forward, number of hidden layers, number of neurons per layer), and once the activity function for each neuron has been chosen, we will need to set the weights that in turn will define the internal states for each neuron in the network. We will see how to do that for a 1-layer network and then how to extend it to a deep feed-forward network. For a deep neural network the algorithm to set the weights is called the Backpropagation algorithm, and we will discuss and explain this algorithm for most of this section as it is one of the most important topics for multilayer feed-forward neural networks. First, however, we will quickly discuss this for 1-layer neural networks. The general concept we need to understand is the following: every neural network is an approximation of a function, therefore each neural network will not be equal to the desired function, instead it will differ by some value. This value is called the error and the aim is to minimize this error. Since the error is a function of the weights in the neural network, we want to minimize the error with respect to the weights. The error function is a function of many weights, it is therefore a function of many variables. Mathematically, the set of points where this function is zero represents therefore a hypersurface and to find a minimum on this surface we want to pick a point and then follow a curve in the direction of the minimum.

Linear regression

To simplify things we are going to introduce matrix notation. Let x be the input, we can think of x as a vector. In the case of linear regression we are going to consider a single output neuron y, the set of weights w is therefore a vector of dimension the same as the dimension of x. The activation value is then defined as the inner product <x, w>.

Let’s say that for each input value x we want to output a target value t, while for each x the neural network will output a value y defined by the activity function chosen, in this case the absolute value of the difference (y-t) represents the difference between the predicted value and the actual value for the specific input example x. If we have m input values xi, each of them will have a target value ti. In this case we calculate the error using the mean squared error , where each yi is a function of w. The error is therefore a function of w and it is usually denoted with J(w).

As mentioned above, this represents a hypersurface of dimension equal to the dimension of w (we are implicitly also considering the bias), and for each wj we need to find a curve that will lead towards the minimum of the surface. The direction in which a curve increases in a certain direction is given by its derivative with respect to that direction, in this case by:

And in order to move towards the minimum we need to move in the opposite direction set by  for each wj.

Let’s calculate:

If , then  and therefore:

The notation can sometimes be confusing, especially the first time one sees it. The input is given by vectors xi, where the superscript indicated the ith example. Since x and w are vectors, the subscript indicates the jth coordinate of the vector. yi then represents the output of the neural network given the input xi, while ti represents the target, that is, the desired value corresponding to the input xi.

In order to move towards the minimum, we need to move each weight in the direction of its derivative by a small amount l, called the learning rate, typically much smaller than 1, (say 0.1 or smaller). We can therefore drop the 2 in the derivative and incorporate it in the learning rate, to get the update rule therefore given by:

or, more in general, we can write the update rule in matrix form as:

where ∇ represents the vector of partial derivatives. This process is what is often called gradient descent.

One last note, the update can be done after having calculated all the input vectors, however, in some cases, the weights could be updated after each example or after a defined preset number of examples.

Logistic regression

In logistic regression, the output is not continuous; rather it is defined as a set of classes. In this case, the activation function is not going to be the identity function like before, rather we are going to use the logistic sigmoid function. The logistic sigmoid function, as we have seen before, outputs a real value in (0,1) and therefore it can be interpreted as a probability function, and that is why it can work so well in a 2-class classification problem. In this case, the target can be one of two classes, and the output represents the probability that it be one of those two classes (say t=1).Let’s denote with σ(a), with a the activation value,the logistic sigmoid function, therefore, for each examplex, the probability that the output be the class y, given the weights w, is:

We can write that equation more succinctly as:

and, since for each sample xi the probabilities are independent, we have that the global probability is:

If we take the natural log of the above equation (to turn products into sums), we get:

The object is now to maximize this log to obtain the highest probability of predicting the correct results. Usually, this is obtained, as in the previous case, by using gradient descent to minimize the cost function defined by.

As before, we calculate the derivative of the cost function with respect to the weights wj to obtain:

In general, in case of a multi-class output t, with t a vector (t1, …, tn), we can generalize this equation using J (w) = −log(P( y x,w))= Ei,ti j, log ( (di)) that brings to the update equation for the weights:

This is similar to the update rule we have seen for linear regression.


In the case of 1-layer, weight-adjustment was easy, as we could use linear or logistic regression and adjust the weights simultaneously to get a smaller error (minimizing the cost function). For multi-layer neural networks we can use a similar argument for the weights used to connect the last hidden layer to the output layer, as we know what we would like the output layer to be, but we cannot do the same for the hidden layers, as, a priori, we do not know what the values for the neurons in the hidden layers ought to be. What we do, instead, is calculate the error in the last hidden layer and estimate what it would be in the previous layer, propagating the error back from last to first layer, hence the name Backpropagation. Backpropagation is one of the most difficult algorithms to understand at first, but all is needed is some knowledge of basic differential calculus and the chain rule. Let’s introduce some notation first. We denote with Jthe cost (error), with y the activity function that is defined on the activation value a (for example y could be the logistic sigmoid), which is a function of the weights w and the input x. Let’s also define wi,j the weight between the ith input value and the jth output. Here we define input and output more generically than for 1-layer network, if wi,j connects a pair of successive layers in a feed-forward network, we denote as input the neurons on the first of the two successive layers, and output the neurons on the second of the two successive layers. In order not to make the notation too heavy, and have to denote on which layer each neuron is, we assume that the ith input yi is always in the layer preceding the layer containing the jth output yj The letter y is used to both denote an input and the activity function, and we can easily infer which one we mean by the contest. We also use subscripts and jwhere we always have ibelonging to the layer preceding the layer containing the element with subscript j. We also use subscripts i and j, where we always have the element with subscript i belonging to the layer preceding the layer containing the element with subscript j.

In this example, layer 1 represents the input, and layer 2 the output

Using this notation, and the chain-rule for derivatives, for the last layer of our neural network we can write:

Since we know that , we have:

If y is the logistic sigmoid defined above, we get the same result we have already calculated at the end of the previous section, since we know the cost function and we can calculate all derivatives.

For the previous layers the same formula holds:

Since we know that and we know that  is the derivative of the activity function that we can calculate, all we need to calculate is the derivative . Let’s notice that this is the derivative of the error with respect to the activation function in the second layer, and, if we can calculate this derivative for the last layer, and have a formula that allows us to calculate the derivative for one layer assuming we can calculate the derivative for the next, we can calculate all the derivatives starting from the last layer and move backwards.

Let us notice that, as we defined the yj, they are the activation values for the neurons in the second layer, but they are also the activity functions, therefore functions of the activation values in the first layer. Therefore, applying the chain rule, we have:

and once again we can calculate both and, so , once we knowwe can calculate, and since we can calculate for the last layer, we can move backward and calculate for any layer and therefore  for any layer.

Summarizing, if we have a sequence of layers where:

We then have these two fundamental equations, where the summation in the second equation should read as the sum over all the outgoing connections fromyj to any neuron yk in the successive layer:

By using these two equations we can calculate the derivatives for the cost with respect to each layer.

If we set ,   represents the variation of the cost with respect to the activation value, and we can think of as the error at the neuron yj. We can then rewrite as:

which implies that . These two equations give an alternate way of seeing Backpropagation, as the variation of the cost with respect to the activation value, and provide a formula to calculate this variation for any layer once we know the variation for the following layer:

We can also combine these equations and show that:

The Backpropagation algorithm for updating the weights is then given on each layer by:

In the last section we will provide a code example that will help understand and apply these concepts and formulas.


At the end of this articlewe learnt the post neural networks architecture phaseand the use of the Backpropagation algorithm and we saw see how we can stack many layers to create and use deep feed-forward neural networks, and how a neural network can have many layers, and why inner (hidden) layers are important.

Resources for Article:

Further resources on this subject:


Please enter your comment!
Please enter your name here