Deep feedforward networks, also called feedforward neural networks, are sometimes also referred to as **Multilayer Perceptrons** (**MLPs**). The goal of a feedforward network is to approximate the function of *f∗*. For example, for a classiﬁer, *y=f∗(x)* maps an input *x* to a label *y.* A feedforward network defines a mapping from input to label *y=f(x;θ)*. It learns the value of the parameter *θ* that results in the best function approximation.

This tutorial is an excerpt from the book, Neural Network Programming with Tensorflow by Manpreet Singh Ghotra, and Rajdeep Dua. With this book, learn how to implement more advanced neural networks like CCNs, RNNs, GANs, deep belief networks and others in Tensorflow.

## How do **feedforward networks work?**

Feedforward networks are a conceptual stepping stone on the path to recurrent networks, which power many natural language applications. Feedforward neural networks are called networks because they compose together many diﬀerent functions which represent them. These functions are composed in a directed acyclic graph.

The model is associated with a directed acyclic graph describing how the functions are composed together. For example, there are three functions *f(1)*, *f(2)*, and *f(3)* connected to form *f(x) =f(3)(f(2)(f(1)(x)))*. These chain structures are the most commonly used structures of neural networks. In this case, *f(1)* i*s* called the **first layer** of the network, *f(2)* is called the **second layer**, and so on. The overall length of the chain gives the depth of the model. It is from this terminology that the name deep learning arises. The final layer of a feedforward network is called the **output layer**.

Diagram showing various functions activated on input x to form a neural network

These networks are called neural because they are inspired by neuroscience. Each hidden layer is a vector. The dimensionality of these hidden layers determines the width of the model.

**Implementing feedforward networks with TensorFlow**

Feedforward networks can be easily implemented using TensorFlow by defining placeholders for hidden layers, computing the activation values, and using them to calculate predictions. Let’s take an example of classification with a feedforward network:

X = tf.placeholder("float", shape=[None, x_size]) y = tf.placeholder("float", shape=[None, y_size]) weights_1 = initialize_weights((x_size, hidden_size), stddev) weights_2 = initialize_weights((hidden_size, y_size), stddev) sigmoid = tf.nn.sigmoid(tf.matmul(X, weights_1)) y = tf.matmul(sigmoid, weights_2)

Once the predicted value tensor has been defined, we calculate the `cost` function:

cost = tf.reduce_mean(tf.nn.OPERATION_NAME(labels=, logits= )) updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost)

Here, `OPERATION_NAME` could be one of the following:

`tf.nn.sigmoid_cross_entropy_with_logits`: Calculates sigmoid cross entropy on incoming`logits`and`labels`:

sigmoid_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, name=None )Formula implemented is max(x, 0) - x * z + log(1 + exp(-abs(x)))

`_sentinel`: Used to prevent positional parameters. Internal, do not use.

`labels`: A tensor of the same type and shape as logits.

`logits`: A tensor of type `float32` or `float64`. The formula implemented is ( *x = logits*, *z = labels*) `max(x, 0) - x * z + log(1 + exp(-abs(x)))`.

`tf.nn.softmax`: Performs`softmax`activation on the incoming tensor. This only normalizes to make sure all the probabilities in a tensor row add up to one. It cannot be directly used in a classification.

softmax = exp(logits) / reduce_sum(exp(logits), dim)

`logits`: A non-empty tensor. Must be one of the following types–half, `float32`, or `float64`.

`dim`: The dimension `softmax` will be performed on. The default is `-1`, which indicates the last dimension.

`name`: A name for the operation (optional).

`tf.nn.log_softmax`: Calculates the log of the `softmax` function and helps in normalizing underfitting. This function is also just a normalization function.

log_softmax( logits, dim=-1, name=None )

`logits`: A non-empty tensor. Must be one of the following types–half, `float32`, or `float64`.

`dim`: The dimension `softmax` will be performed on. The default is `-1`, which indicates the last dimension.

`name`: A name for the operation (optional).

`tf.nn.softmax_cross_entropy_with_logits`

softmax_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, dim=-1, name=None )

`_sentinel`: Used to prevent positional parameters. For internal use only.

`labels`: Each rows `labels[i]` must be a valid probability distribution.

`logits`: Unscaled log probabilities.

`dim`: The class dimension. Defaulted to `-1`, which is the last dimension.

`name`: A name for the operation (optional).

The preceding code snippet computes `softmax` cross entropy between `logits` and `labels.` While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. For exclusive labels, use (where one and only one class is true at a time) `sparse_softmax_cross_entropy_with_logits`.

`tf.nn.sparse_softmax_cross_entropy_with_logits`

sparse_softmax_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, name=None )

`labels`: Tensor of shape [*d_0*, *d_1*, *…*, *d_(r-1)*] (where *r* is the rank of labels and result) and `dtype`, `int32`, or `int64`. Each entry in labels must be an index in [*0*, `num_classes`). Other values will raise an exception when this operation is run on the CPU and return NaN for corresponding loss and gradient rows on the GPU.

`logits`: Unscaled log probabilities of shape [*d_0*, *d_1*, *…*, *d_(r-1)*, `num_classes`] and `dtype`, `float32`, or `float64`.

The preceding code computes sparse `softmax` cross entropy between `logits` and `labels`. The probability of a given label is considered exclusive. Soft classes are not allowed, and the label’s vector must provide a single specific index for the true class for each row of `logits`.

`tf.nn.weighted_cross_entropy_with_logits`

weighted_cross_entropy_with_logits( targets, logits, pos_weight, name=None )

`targets`: A tensor of the same type and shape as logits.

`logits`: A tensor of type `float32` or `float64`.

`pos_weight`: A coefficient to use on the positive examples.

This is similar to `sigmoid_cross_entropy_with_logits()` except that `pos_weight` allows a trade-off of recall and precision by up or down-weighting the cost of a positive error relative to a negative error.

**Analyzing the Iris dataset with a Tensorflow feedforward network**

Let’s look at a feedforward example using the Iris dataset.

You can download the dataset from https://github.com/ml-resources/neuralnetwork-programming/blob/ed1/ch02/iris/iris.csv and the target labels from https://github.com/ml-resources/neuralnetwork-programming/blob/ed1/ch02/iris/target.csv.

In the Iris dataset, we will use 150 rows of data made up of 50 samples from each of three Iris species: Iris setosa, Iris virginica, and Iris versicolor.

Petal geometry compared from three iris species:

**Iris Setosa**, **Iris Virginica**, and **Iris Versicolor**.

In the dataset, each row contains data for each flower sample: sepal length, sepal width, petal length, petal width, and flower species. Flower species are stored as integers, with 0 denoting Iris setosa, 1 denoting Iris versicolor, and 2 denoting Iris virginica.

First, we will create a `run()` function that takes three parameters–hidden layer size `h_size`, standard deviation for weights `stddev`, and Step size of Stochastic Gradient Descent `sgd_step`:

def run(h_size, stddev, sgd_step)

Input data loading is done using the `genfromtxt` function in `numpy`. The Iris data loaded has a shape of L: 150 and W: 4. Data is loaded in the `all_X` variable. Target labels are loaded from `target.csv` in `all_Y` with the shape of L: 150, W:3:

def load_iris_data(): from numpy import genfromtxt data = genfromtxt('iris.csv', delimiter=',') target = genfromtxt('target.csv', delimiter=',').astype(int) # Prepend the column of 1s for bias L, W = data.shape all_X = np.ones((L, W + 1)) all_X[:, 1:] = data num_labels = len(np.unique(target)) all_y = np.eye(num_labels)[target] return train_test_split(all_X, all_y, test_size=0.33, random_state=RANDOMSEED)

Once data is loaded, we initialize the weights matrix based on `x_size`, `y_size`, and `h_size` with standard deviation passed to the `run()` method:

`x_size`= 5`y_size`= 3`h_size`= 128 (or any other number chosen for neurons in the hidden layer)

# Size of Layers x_size = train_x.shape[1] # Input nodes: 4 features and 1 bias y_size = train_y.shape[1] # Outcomes (3 iris flowers) # variables X = tf.placeholder("float", shape=[None, x_size]) y = tf.placeholder("float", shape=[None, y_size]) weights_1 = initialize_weights((x_size, h_size), stddev) weights_2 = initialize_weights((h_size, y_size), stddev)

Next, we make the prediction using `sigmoid` as the activation function defined in the `forward_propagration()` function:

def forward_propagation(X, weights_1, weights_2): sigmoid = tf.nn.sigmoid(tf.matmul(X, weights_1)) y = tf.matmul(sigmoid, weights_2) return y

First, `sigmoid` output is calculated from input `X` and `weights_1`. This is then used to calculate `y` as a matrix multiplication of `sigmoid` and `weights_2`:

y_pred = forward_propagation(X, weights_1, weights_2) predict = tf.argmax(y_pred, dimension=1)

Next, we define the cost function and optimization using gradient descent. Let’s look at the `GradientDescentOptimizer` being used. It is defined in the `tf.train.GradientDescentOptimizer` class and implements the gradient descent algorithm.

To construct an instance, we use the following constructor and pass `sgd_step` as a parameter:

# constructor for GradientDescentOptimizer __init__( learning_rate, use_locking=False, name='GradientDescent' )

Arguments passed are explained here:

`learning_rate`: A tensor or a floating point value. The learning rate to use.`use_locking`: If True, use locks for update operations.`name`: Optional name prefix for the operations created when applying gradients. The default name is`"GradientDescent"`.

The following list shows the code to implement the `cost` function:

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=y_pred)) updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost)

Next, we will implement the following steps:

- Initialize the TensorFlow session:

sess = tf.Session()

- Initialize all the variables using
`tf.initialize_all_variables()`; the return object is used to instantiate the session. - Iterate over
`steps`(1 to 50). - For each step in
`train_x`and`train_y`, execute`updates_sgd`. - Calculate the
`train_accuracy`and`test_accuracy`.

We stored the accuracy for each step in a list so that we could plot a graph:

init = tf.initialize_all_variables() steps = 50 sess.run(init) x = np.arange(steps) test_acc = [] train_acc = [] print("Step, train accuracy, test accuracy") for step in range(steps): # Train with each example for i in range(len(train_x)): sess.run(updates_sgd, feed_dict={X: train_x[i: i + 1], y: train_y[i: i + 1]})

```
train_accuracy = np.mean(np.argmax(train_y, axis=1) ==
sess.run(predict, feed_dict={X: train_x, y: train_y}))
test_accuracy = np.mean(np.argmax(test_y, axis=1) ==
sess.run(predict, feed_dict={X: test_x, y: test_y}))
print("%d, %.2f%%, %.2f%%"
% (step + 1, 100. * train_accuracy, 100. * test_accuracy))
test_acc.append(100. * test_accuracy)
train_acc.append(100. * train_accuracy)
```

### Code execution

Let’s run this code for `h_size` of `128`, standard deviation of `0.1`, and `sgd_step` of `0.01`:

```
def run(h_size, stddev, sgd_step):
...
def main():
run(128,0.1,0.01)
if __name__ == '__main__':
main()
```

The preceding code outputs the following graph, which plots the steps versus the test and train accuracy:

Let’s compare the change in SGD steps and its effect on training accuracy. The following code is very similar to the previous code example, but we will rerun it for multiple SGD steps to see how SGD steps affect accuracy levels.

def run(h_size, stddev, sgd_steps): .... test_accs = [] train_accs = [] time_taken_summary = [] for sgd_step in sgd_steps: start_time = time.time() updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost) sess = tf.Session() init = tf.initialize_all_variables() steps = 50 sess.run(init) x = np.arange(steps) test_acc = [] train_acc = []

```
print("Step, train accuracy, test accuracy")
for step in range(steps):
# Train with each example
for i in range(len(train_x)):
sess.run(updates_sgd, feed_dict={X: train_x[i: i + 1],
y: train_y[i: i + 1]})
train_accuracy = np.mean(np.argmax(train_y, axis=1) ==
sess.run(predict,
feed_dict={X: train_x, y: train_y}))
test_accuracy = np.mean(np.argmax(test_y, axis=1) ==
sess.run(predict,
feed_dict={X: test_x, y: test_y}))
print("%d, %.2f%%, %.2f%%"
% (step + 1, 100. * train_accuracy, 100. * test_accuracy))
#x.append(step)
test_acc.append(100. * test_accuracy)
train_acc.append(100. * train_accuracy)
end_time = time.time()
diff = end_time -start_time
time_taken_summary.append((sgd_step,diff))
t = [np.array(test_acc)]
t.append(train_acc)
train_accs.append(train_acc)
```

Output of the preceding code will be an array with training and test accuracy for each SGD step value. In our example, we called the function `sgd_steps` for an SGD step value of `[0.01, 0.02, 0.03]`:

```
def main():
sgd_steps = [0.01,0.02,0.03]
run(128,0.1,sgd_steps)
if __name__ == '__main__':
main()
```

This is the plot showing how training accuracy changes with `sgd_steps`. For an SGD value of `0.03`, it reaches a higher accuracy faster as the step size is larger.

In this post, we built our first neural network, which was feedforward only, and used it for classifying the contents of the Iris dataset.

You enjoyed a tutorial from the book, Neural Network Programming with Tensorflow. To implement advanced neural networks like CCNs, RNNs, GANs, deep belief networks and others in Tensorflow, grab your copy today!