In this article by Joshua F. Wiley, author of the book, R Deep Learning Essentials, we will discuss deep learning, a powerful multilayered architecture for pattern recognition, signal detection, classification, and prediction. Although deep learning is not new, it has gained popularity in the past decade due to the advances in the computational capacity and new ways of efficient training models, as well as the availability of ever growing amount of data. In this article, you will learn what deep learning is.
What is deep learning?
To understand what deep learning is, perhaps it is easiest to start with what is meant by regular machine learning. In general terms, machine learning is devoted to developing and using algorithms that learn from raw data in order to make predictions. Prediction is a very general term. For example, predictions from machine learning may include predicting how much money a customer will spend at a given company, or whether a particular credit card purchase is fraudulent. Predictions also encompass more general pattern recognition, such as what letters are present in a given image, or whether a picture is of a horse, dog, person, face, building, and so on. Deep learning is a branch of machine learning where a multi-layered (deep) architecture is used to map the relations between inputs or observed features and the outcome. This deep architecture makes deep learning particularly suitable for handling a large number of variables and allows deep learning to generate features as part of the overall learning algorithm, rather than feature creation being a separate step. Deep learning has proven particularly effective in the fields of image recognition (including handwriting as well as photo or object classification) and natural language processing, such as recognizing speech.
There are many types of machine learning algorithms. In this article, we are primarily going to focus on neural networks as these have been particularly popular in deep learning. However, this focus does not mean that it is the only technique available in machine learning or even deep learning, nor that other techniques are not valuable or even better suited, depending on the specific task.
Conceptual overview of neural networks
As their name suggests, neural networks draw their inspiration from neural processes and neurons in the body. Neural networks contain a series of neurons, or nodes, which are interconnected and process input. The connections between neurons are weighted, with these weights based on the function being used and learned from the data. Activation in one set of neurons and the weights (adaptively learned from the data) may then feed into other neurons, and the activation of some final neuron(s) is the prediction.
To make this process more concrete, an example from human visual perception may be helpful. The term grandmother cell is used to refer to the concept that somewhere in the brain there is a cell or neuron that responds specifically to a complex and specific object, such as your grandmother. Such specificity would require thousands of cells to represent every unique entity or object we encounter. Instead, it is thought that visual perception occurs by building up more basic pieces into complex representations. For example, the following is a picture of a square:
Rather than our visual system having cells neurons that are activated only upon seeing the gestalt, or entirety, of a square, we can have cells that recognize horizontal and vertical lines, as shown in the following:
In this hypothetical case, there may be two neurons, one which is activated when it senses horizontal lines and another that is activated when it senses vertical lines. Finally, a higher-order process recognizes that it is seeing a square when both the lower order neurons are activated simultaneously.
Neural networks share some of these same concepts, with inputs being processed by a first layer of neurons that may go on to trigger another layer. Neural networks are sometimes shown as graphical models. In Figure 3, Inputs are data represented as squares. These may be pixels in an image or different aspects of sounds, or something else. The next layer of Hidden neurons is neurons that recognize basic features, such as horizontal lines, vertical lines, or curved lines. Finally, the output may be a neuron that is activated by the simultaneous activation of two of the hidden neurons. In this article, observed data or features are depicted as squares, and unobserved or hidden layers as circles:
Neural networks are used to refer to a broad class of models and algorithms. Hidden neurons are generated based on some combination of the observed data, similar to a basis expansion in other statistical techniques; however, rather than choosing the form of the expansion, the weights used to create the hidden neurons are learned from the data. Neural networks can involve a variety of activation function(s), which are transformations of the weighted raw data inputs to create the hidden neurons. A common choice for activation functions is the sigmoid function: and the hyperbolic tangent function . Finally, radial basis functions are sometimes used as they are efficient function approximators. Although there are a variety of these, the Gaussian form is common: .
In a shallow neural network such as is shown in Figure 3, with only a single hidden layer, from the hidden units to the outputs is essentially a standard regression or classification problem. The hidden units can be denoted by, h, the outputs by, Y. Different outputs can be denoted by subscripts i = 1, …, k and may represent different possible classifications, such as (in our case) a circle or square. The paths from each hidden unit to each output are the weights and for the ith output are denoted by wi. These weights are also learned from the data, just like the weights used to create the hidden layer. For classification, it is common to use a final transformation, the softmax function, which is as this ensures that the estimates are positive (using the exponential function) and that the probability of being in any given class sums to one. For linear regression, the identity function, which returns its input, is commonly used. Confusion may arise as to why there are paths between every hidden unit and output as well as every input and hidden unit. These are commonly drawn to represent that a priori any of these relations are allowed to exist. The weights must then be learned from the data, with zero or near zero weights essentially equating to dropping unnecessary relations.
This only scratches the surface of the conceptual and practical aspects of neural networks. For a slightly more in-depth introduction to neural networks, see Chapter 11 of The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2009) also freely available at http://statweb.stanford.edu/~tibs/ElemStatLearn/. Next, we will turn to a brief introduction to deep neural networks.
Deep neural networks
Perhaps the simplest, if not the most informative, definition of a deep neural network (DNN) is that it is a neural network with multiple hidden layers. Although a relatively simple conceptual extension of neural networks, such deep architecture provides valuable advances in terms of the capability of the models and new challenges in training them.
Using multiple hidden layers allows a more sophisticated build-up from simple elements to more complex ones. When discussing neural networks, we considered the outputs to be whether the object was a circle or a square. In a deep neural network, many circles and squares could be combined to form other more advanced shapes. One can consider two complexity aspects of a model’s architecture. One is how wide or narrow it is—that is, how many neurons in a given layer. The second is how deep it is, or how many layers of neurons there are. For data that truly has such deep architectures, a DNN can fit it more accurately with fewer parameters than a neural network (NN), because more layers (each with fewer neurons) can be a more efficient and accurate representation; for example, because the shallow NN cannot build more advanced shapes from basic pieces, in order to provide equal accuracy to the DNN it must represent each unique object. Again considering pattern recognition in images, if we are trying to train a model for text recognition the raw data may be pixels from an image. The first layer of neurons could be trained to capture different letters of the alphabet, and then another layer could recognize sets of these letters as words. The advantage is that the second layer does not have to directly learn from the pixels, which are noisy and complex. In contrast, a shallow architecture may require far more parameters, as each hidden neuron would have to be capable of going directly from pixels in an image to a complete word, and many words may overlap, creating redundancy in the model.
One of the challenges in training deep neural networks is how to efficiently learn the weights. The models are often complex and local minima abound making the optimization problem a challenging one. One of the major advancements came in 2006, when it was shown that Deep Belief Networks (DBNs) could be trained one layer at a time (Refer A Fast Learning Algorithm for Deep Belief Nets, by Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh, (2006) at http://www.cs.toronto.edu/~fritz/absps/ncfast.pdf). A DBN is a type of DNN where multiple hidden layers and connections between (but not within) layers (that is, a neuron in layer 1 may be connected to a neuron in layer 2, but may not be connected to another neuron in layer 1). This is the essentially the same definition of a Restricted Boltzmann Machine (RBM)—an example is diagrammed in Figure 4, except that a RBM typically has one input layer and one hidden layer:
The restriction of no connections within a layer is valuable as it allows for much faster training algorithms to be used, such as the contrastive divergence algorithm. If several RBMs are stacked together, they can form a DBN. Essentially, the DBN can then be trained as a series of RBMs. The first RBM layer is trained and used to transform raw data into hidden neurons, which are then treated as a new set of inputs in a second RBM, and the process is repeated until all layers have been trained.
The benefits of the realization that DBNs could be trained one layer at a time extend beyond just DBNs, however. DBNs are sometimes used as a pre-training stage for a deep neural network. This allows the comparatively fast, greedy layer-by-layer training to be used to provide good initial estimates, which are then refined in the deep neural network using other, slower, training algorithms such as back propagation.
So far we have been primarily focused on feed-forward neural networks, where the results from one layer and neuron feed forward to the next. Before closing this section, two specific kinds of deep neural networks that have grown in popularity are worth mentioning. The first is a Recurrent Neural Network (RNN) where neurons send feedback signals to each other. These feedback loops allow RNNs to work well with sequences. A recent example of an application of RNNs was to automatically generate click-bait such as One trick to great hair salons don’t want you to know or Top 10 reasons to visit Los Angeles: #6 will shock you!. RNNs work well for such jobs as they can be seeded from a large initial pool of a few words (even just trending search terms or names) and then predict/generate what the next word should be. This process can be repeated a few times until a short phrase is generated, the click-bait. This example is drawn from a blog post by Lars Eidnes, available at http://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-neural-networks/. The second type is a Convolutional Neural Network (CNN). CNNs are most commonly used in image recognition. CNNs work by having each neuron respond to overlapping subregions of an image. The benefits of CNNs are that they require comparatively minimal pre-processing yet still do not require too many parameters through weight sharing (for example, across subregions of an image). This is particularly valuable for images as they are often not consistent. For example, imagine ten different people taking a picture of the same desk. Some may be closer or farther away or at positions resulting in essentially the same image having different heights, widths, and the amount of image captured around the focal object.
As for neural networks, this description only provides the briefest of overviews as to what DNNs are and some of the use cases to which they can be applied.
This article presented a brief introduction to NNs and DNNs. Using multiple hidden layers, DNNs have been a revolution in machine learning by providing a powerful unsupervised learning and feature-extraction component that can be standalone or integrated as part of a supervised model.
There are many applications of such models and they are increasingly used by large-scale companies such as Google, Microsoft, and Facebook. Examples of tasks for deep learning are image recognition (for example, automatically tagging faces or identifying keywords for an image), voice recognition, and text translation (for example, to go from English to Spanish, or vice versa). Work is being done on text recognition, such as sentiment analysis to try to identify whether a sentence or paragraph is generally positive or negative, which is particularly useful to evaluate perceptions about a product or service. Imagine being able to scrape reviews and social media for any mention of your product and analyze whether it was being discussed more favorably than the previous month or year or not!
Resources for Article: