[box type=”note” align=”” class=”” width=””]*This article is an excerpt from a book written by Rajdeep Dua and Manpreet Singh Ghotra titled **Neural Network Programming with Tensorflow**. In this book, you will learn to leverage the power of TensorFlow to train neural networks of varying complexities, without any hassle.*[/box]

Today we will focus on the gradient descent algorithm and its different variants. We will take a simple example of linear regression to solve the optimization problem.

Gradient descent is the most successful optimization algorithm. As mentioned earlier, it is used to do weights updates in a neural network so that we minimize the loss function. Let’s now talk about an important neural network method called backpropagation, in which we firstly propagate forward and calculate the dot product of inputs with their corresponding weights, and then apply an activation function to the sum of products which transforms the input to an output and adds non linearities to the model, which enables the model to learn almost any arbitrary functional mappings.

Later, we back propagate in the neural network, carrying error terms and updating weights values using gradient descent, as shown in the following graph:

**Different variants of gradient descent**

**Standard gradient descent**, also known as **batch gradient descent**, will calculate the gradient of the whole dataset but will perform only one update. Therefore, it can be quite slow and tough to control for datasets which are extremely large and don’t fit in the memory. Let’s now look at algorithms that can solve this problem.

**Stochastic gradient descent** (**SGD**) performs parameter updates on each training example, whereas mini batch performs an update with n number of training examples in each batch. The issue with SGD is that, due to the frequent updates and fluctuations, it eventually complicates the convergence to the accurate minimum and will keep exceeding due to regular fluctuations. Mini-batch gradient descent comes to the rescue here, which reduces the variance in the parameter update, leading to a much better and stable convergence. SGD and mini-batch are used interchangeably.

Overall problems with gradient descent include choosing a proper learning rate so that we avoid slow convergence at small values, or divergence at larger values and applying the same learning rate to all parameter updates wherein if the data is sparse we might not want to update all of them to the same extent. Lastly, is dealing with saddle points.

**Algorithms to optimize gradient descent**

We will now be looking at various methods for optimizing gradient descent in order to calculate different learning rates for each parameter, calculate momentum, and prevent decaying learning rates.

To solve the problem of high variance oscillation of the SGD, a method called momentum was discovered; this accelerates the SGD by navigating along the appropriate direction and softening the oscillations in irrelevant directions. Basically, it adds a fraction of the update vector of the past step to the current update vector. Momentum value is usually set to .9. Momentum leads to a faster and stable convergence with reduced oscillations.

**Nesterov accelerated gradient** explains that as we reach the minima, that is, the lowest point on the curve, momentum is quite high and it doesn’t know to slow down at that point due to the large momentum which could cause it to miss the minima entirely and continue moving up. Nesterov proposed that we first make a long jump based on the previous momentum, then calculate the gradient and then make a correction which results in a parameter update. Now, this update prevents us to go too fast and not miss the minima, and makes it more responsive to changes.

**Adagrad** allows the learning rate to adapt based on the parameters. Therefore, it performs large updates for infrequent parameters and small updates for frequent parameters. Therefore, it is very well-suited for dealing with sparse data. The main flaw is that its learning rate is always decreasing and decaying. Problems with decaying learning rates are solved using AdaDelta.

**AdaDelta** solves the problem of decreasing learning rate in AdaGrad. In AdaGrad, the learning rate is computed as one divided by the sum of square roots. At each stage, we add another square root to the sum, which causes the denominator to decrease constantly. Now, instead of summing all prior square roots, it uses a sliding window which allows the sum to decrease.

**Adaptive Moment Estimation** (**Adam**) computes adaptive learning rates for each parameter. Like AdaDelta, Adam not only stores the decaying average of past squared gradients but additionally stores the momentum change for each parameter. Adam works well in practice and is one of the most used optimization methods today.

The following two images (image credit: Alec Radford) show the optimization behavior of optimization algorithms described earlier. We see their behavior on the contours of a loss surface over time. Adagrad, RMsprop, and Adadelta almost quickly head off in the right direction and converge fast, whereas momentum and NAG are headed off-track. NAG is soon able to correct its course due to its improved responsiveness by looking ahead and going to the minimum.

The second image displays the behavior of the algorithms at a saddle point. **SGD**, **Momentum**, and **NAG** find it challenging to break symmetry, but slowly they manage to escape the saddle point, whereas **Adagrad**, **Adadelta**, and **RMsprop** head down the negative slope, as can seen from the following image:

**Which optimizer to choose**

In the case that the input data is sparse or if we want fast convergence while training complex neural networks, we get the best results using adaptive learning rate methods. We also don’t need to tune the learning rate. For most cases, Adam is usually a good choice.

**Optimization with an example**

Let’s take an example of linear regression, where we try to find the best fit for a straight line through a number of data points by minimizing the squares of the distance from the line to each data point. This is why we call it least squares regression. Essentially, we are formulating the problem as an optimization problem, where we are trying to minimize a loss function.

Let’s set up input data and look at the scatter plot:

```
# input data
xData = np.arange(100, step=.1)
yData = xData + 20 * np.sin(xData/10)
```

Define the data size and batch size:

```
# define the data size and batch size
nSamples = 1000
batchSize = 100
```

We will need to resize the data to meet the TensorFlow input format, as follows:

```
# resize input for tensorflow
xData = np.reshape(xData, (nSamples, 1))
yData = np.reshape(yData, (nSamples, 1))
```

The following scope initializes the weights and bias, and describes the linear model and loss function:

```
with tf.variable_scope("linear-regression-pipeline"): W = tf.get_variable("weights", (1,1),
initializer=tf.random_normal_initializer())
b = tf.get_variable("bias", (1, ), initializer=tf.constant_initializer(0.0))
```

```
# model
yPred = tf.matmul(X, W) + b
# loss function
loss = tf.reduce_sum((y - yPred)**2/nSamples)
```

We then set optimizers for minimizing the loss:

```
# set the optimizer
#optimizer =
tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)
#optimizer = tf.train.AdamOptimizer(learning_rate=.001).minimize(loss)
#optimizer = tf.train.AdadeltaOptimizer(learning_rate=.001).minimize(loss)
#optimizer = tf.train.AdagradOptimizer(learning_rate=.001).minimize(loss)
#optimizer = tf.train.MomentumOptimizer(learning_rate=.001,
momentum=0.9).minimize(loss)
#optimizer = tf.train.FtrlOptimizer(learning_rate=.001).minimize(loss)
optimizer = tf.train.RMSPropOptimizer(learning_rate=.001).minimize(loss)
We then select the mini batch and run the optimizers errors = []
with tf.Session() as sess:
# init variables
sess.run(tf.global_variables_initializer())
for _ in range(1000):
# select mini batch
indices = np.random.choice(nSamples, batchSize)
xBatch, yBatch = xData[indices], yData[indices]
# run optimizer
_, lossVal = sess.run([optimizer, loss], feed_dict={X: xBatch, y:
yBatch})
errors.append(lossVal)
plt.plot([np.mean(errors[i-50:i]) for i in range(len(errors))])
plt.show()
plt.savefig("errors.png")
```

The output of the preceding code is as follows:

We also get a sliding curve, as follows:

We learned optimization is a complicated subject and a lot depends on the nature and size of our data. Also, optimization depends on weight matrices. A lot of these optimizers are trained and tuned for tasks like image classification or predictions. However, for custom or new use cases, we need to perform trial and error to determine the best solution.

*To know more about how to build and optimize neural networks using TensorFlow, do checkout this book **Neural Network Programming with Tensorflow**.*