This article will take you through the history of Deep learning and how it has grown over time. It will walk you through some of the core concepts of Deep Learning like sigmoid activation, rectified linear unit(ReLU), etc.

## Evolution of deep learning

A lot of the important work on neural networks happened in the 80’s and in the 90’s, but back then computers were slow and datasets very tiny. The research didn’t really find many applications in the real world. As a result, in the first decade of the 21st century, neural networks have completely disappeared from the world of machine learning. It’s only in the last few years, first seeing speech recognition around 2009, and then in computer vision around 2012, that neural networks made a big comeback with (LeNet, AlexNet). What changed?

Lots of data (big data) and cheap, fast GPU’s. Today, neural networks are everywhere. So, if you’re doing anything with data, analytics, or prediction, deep learning is definitely something that you want to get familiar with.

See the following figure:

Deep learning is an exciting branch of machine learning that uses data, lots of data, to teach computers how to do things only humans were capable of before, such as recognizing what’s in an image, what people are saying when they are talking on their phone, translating a document into another language, helping robots explore the world and interact with it. Deep learning has emerged as a central tool to solve perception problems and it’s the state of the art with computer vision and speech recognition.

Today many companies have made deep learning a central part of their machine learning toolkit—Facebook, Baidu, Amazon, Microsoft, and Google are all using deep learning in their products because deep learning shines wherever there is lots of data and complex problems to solve.

Deep learning is the name we often use for “deep neural networks” composed of several layers. Each layer is made of nodes. The computation happens in the node, where it combines input data with a set of parameters or weights, that either amplify or dampen that input. These input-weight products are then summed and the sum is passed through activation function, to determine what extent the value should progress through the network to affect the final prediction, such as an act of classification. A layer consists of row of nodes that that turn on or off as the input is fed through the network based. The input of the first layer becomes the input of the second layer and so on. Here’s a diagram of what neural network might look like:

Let’s get familiarize with some deep neural network concepts and terminology.

**Sigmoid activation**

Sigmoid activation function used in neural network has an output boundary of (0, 1), and α is the offset parameter to set the value at which the sigmoid evaluates to 0.

Sigmoid function often works fine for gradient descent as long as input data x is kept within a limit. For large values of x, y is constant. Hence, the derivatives dy/dx (the gradient) equates to 0, which is often termed as the vanishing gradient problem.

This is a problem because when the gradient is 0, multiplying it with the loss (actual value – predicted value) also gives us 0 and ultimately networks stops learning.

**Rectified Linear Unit (ReLU)**

A neural network can be built from combining some linear classifier with some non-linear function. The Rectified Linear Unit (ReLU) has become very popular in the last few years. It computes the function f(x) = max(0,x) f(x)=max(0,x). In other words, the activation is simply thresholded at zero. Unfortunately, ReLU units can be fragile during training and can die as a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again and so the gradient flowing through the unit will forever be zero from that point on.

To overcome this problem a leaky ReLU function will have a small negative slope (of 0.01, or so) instead of zero when x

f(x)= (x (x>=0)(x)f(x)=1(x=0)(x) where αα is a small constant.

**Exponential Linear Unit (ELU)**

The mean of ReLU activation is not zero and hence sometime makes the learning difficult for the network. Exponential Linear Unit (ELU) is similar to ReLU activation function when input x is positive, but for negative values it is a function bounded by a fixed value -1, for α=1 (hyperparameter α controls the value to which an ELU saturates for negative inputs). This behavior helps to push the mean activation of neurons closer to zero, that helps to learn representations that are more robust to noise.

**Stochastic Gradient Descent (SGD)**

Scaling batch gradient descent is cumbersome because it has to compute a lot if the dataset is big and as a rule of thumb. If computing your loss takes n floating point operations, computing its gradient takes about three times that compute.

But in practice we want to be able to train lots of data because on real problems we will always get more gains the more data we use. And because gradient descent is iterative and have to do that for many steps. So, that means that in-order to update the parameters in a single step, it has to go through all the data samples and then doing this iteration over the data tens or hundreds of times.

Instead of computing the loss over entire data samples for every step, we can compute the average loss for a very small random fraction of the training data. Think between 1 and 1000 training samples each time. This technique is called Stochastic Gradient Descent (SGD) and is at the core of deep learning. That’s because SGD scales well with both data and model size.

SGD gets its reputation for being black magic as it has lots of hyper-parameters to play and tune such as initialization parameters, learning rate parameters, decay, momentum, and you have to get them right.

Deep Learning has emerged over time with its evolution from neural networks to machine learning. It is an intriguing segment of machine learning that uses huge amount of data, to teach computers how to do things that only humans were capable of. It highlights some of the key players who have adopted this concept at the very early stage that are Facebook, Baidu, Amazon, Microsoft, and Google. It shows the different concept layers through which deep learning is executed.

*If Deep Learning has got you hooked, wait till you learn what GANs are from the book *** Learning Generative Adversarial Networks**.