[box type=”note” align=”” class=”” width=””]*This article is an excerpt taken from a book **Neural Network Programming with Java Second Edition** written by Fabio M. Soares and Alan M. F. Souza. This book is for Java developers who want to master developing smarter applications like weather forecasting, pattern recognition etc using neural networks. *[/box]

In this article we will discuss about perceptrons along with their features, applications and limitations.

Perceptrons are a very popular neural network architecture that implements supervised learning. Projected by Frank Rosenblatt in 1957, it has just one layer of neurons, receiving a set of inputs and producing another set of outputs. This was one of the first representations of neural networks to gain attention, especially because of their simplicity.

In our Java implementation, this is illustrated with one neural layer (the output layer). The following code creates a perceptron with three inputs and two outputs, having the linear function at the output layer:

```
int numberOfInputs=3;
int numberOfOutputs=2;
Linear outputAcFnc = new Linear(1.0);
NeuralNet perceptron = new NeuralNet(numberOfInputs,numberOfOutputs,
outputAcFnc);
```

**Applications and limitations**

However, scientists did not take long to conclude that a perceptron neural network could only be applied to simple tasks, according to that simplicity. At that time, neural networks were being used for simple classification problems, but perceptrons usually failed when faced with more complex datasets. Let’s illustrate this with a very basic example (an AND function) to understand better this issue.

**Linear separation**

The example consists of an AND function that takes two inputs, x1 and x2. That function can be plotted in a two-dimensional chart as follows:

And now let’s examine how the neural network evolves the training using the perceptron rule, considering a pair of two weights, w1 and w2, initially 0.5, and bias valued 0.5 as well. Assume learning rate η equals 0.2:

Epoch |
x1 |
x2 |
w1 |
w2 |
b |
y |
t |
E |
Δw1 |
Δw2 |
Δb |

1 | 0 | 0 | 0.5 | 0.5 | 0.5 | 0.5 | 0 | -0.5 | 0 | 0 | -0.1 |

1 | 0 | 1 | 0.5 | 0.5 | 0.4 | 0.9 | 0 | -0.9 | 0 | -0.18 | -0.18 |

1 | 1 | 0 | 0.5 | 0.32 | 0.22 | 0.72 | 0 | -0.72 | -0.144 | 0 | -0.144 |

1 | 1 | 1 | 0.356 | 0.32 | 0.076 | 0.752 | 1 | 0.248 | 0.0496 | 0.0496 | 0.0496 |

2 | 0 | 0 | 0.406 | 0.370 | 0.126 | 0.126 | 0 | -0.126 | 0.000 | 0.000 | -0.025 |

2 | 0 | 1 | 0.406 | 0.370 | 0.100 | 0.470 | 0 | -0.470 | 0.000 | -0.094 | -0.094 |

2 | 1 | 0 | 0.406 | 0.276 | 0.006 | 0.412 | 0 | -0.412 | -0.082 | 0.000 | -0.082 |

2 | 1 | 1 | 0.323 | 0.276 | -0.076 | 0.523 | 1 | 0.477 | 0.095 | 0.095 | 0.095 |

… | … | ||||||||||

89 | 0 | 0 | 0.625 | 0.562 | -0.312 | -0.312 | 0 | 0.312 | 0 | 0 | 0.062 |

89 | 0 | 1 | 0.625 | 0.562 | -0.25 | 0.313 | 0 | -0.313 | 0 | -0.063 | -0.063 |

89 | 1 | 0 | 0.625 | 0.500 | -0.312 | 0.313 | 0 | -0.313 | -0.063 | 0 | -0.063 |

89 | 1 | 1 | 0.562 | 0.500 | -0.375 | 0.687 | 1 | 0.313 | 0.063 | 0.063 | 0.063 |

After 89 epochs, we find the network to produce values near to the desired output. Since in this example the outputs are binary (zero or one), we can assume that any value produced by the network that is below 0.5 is considered to be 0 and any value above 0.5 is considered to be 1. So, we can draw a function , with the final weights and bias found by the learning algorithm w1=0.562, w2=0.5 and b=-0.375, defining the linear boundary in the chart:

This boundary is a definition of all classifications given by the network. You can see that the boundary is linear, given that the function is also linear. Thus, the perceptron network is really suitable for problems whose patterns are linearly separable.

**The XOR case**

Now let’s analyze the XOR case:

We see that in two dimensions, it is impossible to draw a line to separate the two patterns. What would happen if we tried to train a single layer perceptron to learn this function? Suppose we tried, let’s see what happened in the following table:

Epoch |
x1 |
x2 |
w1 |
w2 |
b |
y |
t |
E |
Δw1 |
Δw2 |
Δb |

1 | 0 | 0 | 0.5 | 0.5 | 0.5 | 0.5 | 0 | -0.5 | 0 | 0 | -0.1 |

1 | 0 | 1 | 0.5 | 0.5 | 0.4 | 0.9 | 1 | 0.1 | 0 | 0.02 | 0.02 |

1 | 1 | 0 | 0.5 | 0.52 | 0.42 | 0.92 | 1 | 0.08 | 0.016 | 0 | 0.016 |

1 | 1 | 1 | 0.516 | 0.52 | 0.436 | 1.472 | 0 | -1.472 | -0.294 | -0.294 | -0.294 |

2 | 0 | 0 | 0.222 | 0.226 | 0.142 | 0.142 | 0 | -0.142 | 0.000 | 0.000 | -0.028 |

2 | 0 | 1 | 0.222 | 0.226 | 0.113 | 0.339 | 1 | 0.661 | 0.000 | 0.132 | 0.132 |

2 | 1 | 0 | 0.222 | 0.358 | 0.246 | 0.467 | 1 | 0.533 | 0.107 | 0.000 | 0.107 |

2 | 1 | 1 | 0.328 | 0.358 | 0.352 | 1.038 | 0 | -1.038 | -0.208 | -0.208 | -0.208 |

… | … | ||||||||||

127 | 0 | 0 | -0.250 | -0.125 | 0.625 | 0.625 | 0 | -0.625 | 0.000 | 0.000 | -0.125 |

127 | 0 | 1 | -0.250 | -0.125 | 0.500 | 0.375 | 1 | 0.625 | 0.000 | 0.125 | 0.125 |

127 | 1 | 0 | -0.250 | 0.000 | 0.625 | 0.375 | 1 | 0.625 | 0.125 | 0.000 | 0.125 |

127 | 1 | 1 | -0.125 | 0.000 | 0.750 | 0.625 | 0 | -0.625 | -0.125 | -0.125 | -0.125 |

The perceptron just could not find any pair of weights that would drive the following error 0.625. This can be explained mathematically as we already perceived from the chart that this function cannot be linearly separable in two dimensions. So what if we add another dimension? Let’s see the chart in three dimensions:

In three dimensions, it is possible to draw a plane that would separate the patterns, provided that this additional dimension could properly transform the input data. Okay, but now there is an additional problem: how could we derive this additional dimension since we have only two input variables? One obvious, but also workaround, answer would be adding a third variable as a derivation from the two original ones. And being this third variable a (derivation), our neural network would probably get the following shape:

Okay, now the perceptron has three inputs, one of them being a composition of the other. This also leads to a new question: how should that composition be processed? We can see that this component could act as a neuron, so giving the neural network a nested architecture. If so, there would another new question: how would the weights of this new neuron be trained, since the error is on the output neuron?

## Multi-layer perceptrons

As we can see, one simple example in which the patterns are not linearly separable has led us to more and more issue using the perceptron architecture. That need led to the application of multilayer perceptrons. The fact that the natural neural network is structured in layers as well, and each layer captures pieces of information from a specific environment is already established. In artificial neural networks, layers of neurons act in this way, by extracting and abstracting information from data, transforming them into another dimension or shape.

In the XOR example, we found the solution to be the addition of a third component that would make possible a linear separation. But there remained a few questions regarding how that third component would be computed. Now let’s consider the same solution as a two-layer perceptron:

Now we have three neurons instead of just one, but in the output the information transferred by the previous layer is transformed into another dimension or shape, whereby it would be theoretically possible to establish a linear boundary on those data points. However, the question on finding the weights for the first layer remains unanswered, or can we apply the same training rule to neurons other than the output? We are going to deal with this issue in the Generalized delta rule section.

### MLP properties

Multi-layer perceptrons can have any number of layers and also any number of neurons in each layer. The activation functions may be different on any layer. An MLP network is usually composed of at least two layers, one for the output and one hidden layer.

There are also some references that consider the input layer as the nodes that collect input data; therefore, for those cases, the MLP is considered to have at least three layers. For the purpose of this article, let’s consider the input layer as a special type of layer which has no weights, and as the effective layers, that is, those enabled to be trained, we’ll consider the hidden and output layers.

A hidden layer is called that because it actually hides its outputs from the external world. Hidden layers can be connected in series in any number, thus forming a deep neural network. However, the more layers a neural network has, the slower would be both training and running, and according to mathematical foundations, a neural network with one or two hidden layers at most may learn as well as deep neural networks with dozens of hidden layers. But it depends on several factors.

### MLP weights

In an MLP feedforward network, one particular neuron i receives data from a neuron j of the previous layer and forwards its output to a neuron k of the next layer:

The mathematical description of a neural network is recursive:

Here, yo is the network output (should we have multiple outputs, we can replace yo with Y, representing a vector); fo is the activation function of the output; l is the number of hidden layers; nhi is the number of neurons in the hidden layer i; wi is the weight connecting the i th neuron of the last hidden layer to the output; fi is the activation function of the neuron i; and bi is the bias of the neuron i. It can be seen that this equation gets larger as the number of layers increases. In the last summing operation, there will be the inputs xi.

### Recurrent MLP

The neurons on an MLP may feed signals not only to neurons in the next layers (feedforward network), but also to neurons in the same or previous layers (feedback or recurrent). This behavior allows the neural network to maintain state on some data sequence, and this feature is especially exploited when dealing with time series or handwriting recognition. Recurrent networks are usually harder to train, and eventually the computer may run out of memory while executing them. In addition, there are recurrent network architectures better than MLPs, such as Elman, Hopfield, Echo state, Bidirectional RNNs (recurrent neural networks). But we are not going to dive deep into these architectures.

### Coding an MLP

Bringing these concepts into the OOP point of view, we can review the classes already designed so far:

One can see that the neural network structure is hierarchical. A neural network is composed of layers that are composed of neurons. In the MLP architecture, there are three types of layers: input, hidden, and output. So suppose that in Java, we would like to define a neural network consisting of three inputs, one output (linear activation function) and one hidden layer (sigmoid function) containing five neurons. The resulting code would be as follows:

```
int numberOfInputs=3;
int numberOfOutputs=1;
int[] numberOfHiddenNeurons={5};
Linear outputAcFnc = new Linear(1.0);
Sigmoid hiddenAcFnc = new Sigmoid(1.0);
NeuralNet neuralnet = new NeuralNet(numberOfInputs, numberOfOutputs,
numberOfHiddenNeurons, hiddenAcFnc, outputAcFnc);
```

To summarize, we saw how perceptrons can be applied to solve linear separation problems, their limitations in classifying nonlinear data and how to suppress those limitations with multi-layer perceptrons (MLPs).

*If you enjoyed this excerpt, check out the book **Neural Network Programming with Java Second Edition**for a better understanding of neural networks and how they fit in different real-world projects.*