11 min read

Deep feedforward networks, also called feedforward neural networks, are sometimes also referred to as Multilayer Perceptrons (MLPs). The goal of a feedforward network is to approximate the function of f∗. For example, for a classifier, y=f∗(x) maps an input x to a label y. A feedforward network defines a mapping from input to label y=f(x;θ). It learns the value of the parameter θ that results in the best function approximation.

This tutorial is an excerpt from the book, Neural Network Programming with Tensorflow by Manpreet Singh Ghotra, and Rajdeep Dua. With this book, learn how to implement more advanced neural networks like CCNs, RNNs, GANs, deep belief networks and others in Tensorflow.

How do feedforward networks work?

Feedforward networks are a conceptual stepping stone on the path to recurrent networks, which power many natural language applications. Feedforward neural networks are called networks because they compose together many different functions which represent them. These functions are composed in a directed acyclic graph.

The model is associated with a directed acyclic graph describing how the functions are composed together. For example, there are three functions f(1)f(2), and f(3) connected to form f(x) =f(3)(f(2)(f(1)(x))). These chain structures are the most commonly used structures of neural networks. In this case, f(1) is called the first layer of the network, f(2) is called the second layer, and so on. The overall length of the chain gives the depth of the model. It is from this terminology that the name deep learning arises. The final layer of a feedforward network is called the output layer.

output layerDiagram showing various functions activated on input x to form a neural network

These networks are called neural because they are inspired by neuroscience. Each hidden layer is a vector. The dimensionality of these hidden layers determines the width of the model.

Implementing feedforward networks with TensorFlow

Feedforward networks can be easily implemented using TensorFlow by defining placeholders for hidden layers, computing the activation values, and using them to calculate predictions. Let’s take an example of classification with a feedforward network:

X = tf.placeholder("float", shape=[None, x_size])
y = tf.placeholder("float", shape=[None, y_size])
weights_1 = initialize_weights((x_size, hidden_size), stddev)
weights_2 = initialize_weights((hidden_size, y_size), stddev)
sigmoid = tf.nn.sigmoid(tf.matmul(X, weights_1))
y = tf.matmul(sigmoid, weights_2)

Once the predicted value tensor has been defined, we calculate the cost function:

cost = tf.reduce_mean(tf.nn.OPERATION_NAME(labels=<actual value>, logits=<predicted value>))
updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost)

Here, OPERATION_NAME could be one of the following:

  • tf.nn.sigmoid_cross_entropy_with_logits: Calculates sigmoid cross entropy on incoming logits and labels:
sigmoid_cross_entropy_with_logits(
  _sentinel=None,
  labels=None,
  logits=None,
  name=None
)Formula implemented is max(x, 0) - x * z + log(1 + exp(-abs(x)))

_sentinel: Used to prevent positional parameters. Internal, do not use.
labels: A tensor of the same type and shape as logits.
logits: A tensor of type float32 or float64. The formula implemented is ( x = logitsz = labelsmax(x, 0) - x * z + log(1 + exp(-abs(x))).

  • tf.nn.softmax: Performs softmax activation on the incoming tensor. This only normalizes to make sure all the probabilities in a tensor row add up to one. It cannot be directly used in a classification.
softmax = exp(logits) / reduce_sum(exp(logits), dim)

logits: A non-empty tensor. Must be one of the following types–half, float32, or float64.
dim: The dimension softmax will be performed on. The default is -1, which indicates the last dimension.
name: A name for the operation (optional).
tf.nn.log_softmax: Calculates the log of the softmax function and helps in normalizing underfitting. This function is also just a normalization function.

log_softmax(
 logits,
 dim=-1,
 name=None
)

logits: A non-empty tensor. Must be one of the following types–half, float32, or float64.
dim: The dimension softmax will be performed on. The default is -1, which indicates the last dimension.
name: A name for the operation (optional).

  • tf.nn.softmax_cross_entropy_with_logits
softmax_cross_entropy_with_logits(
  _sentinel=None,
  labels=None,
  logits=None,
  dim=-1,
  name=None
)

_sentinel: Used to prevent positional parameters. For internal use only.
labels: Each rows labels[i] must be a valid probability distribution.
logits: Unscaled log probabilities.
dim: The class dimension. Defaulted to -1, which is the last dimension.
name: A name for the operation (optional).

The preceding code snippet computes softmax cross entropy between logits and labels. While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. For exclusive labels, use (where one and only one class is true at a time) sparse_softmax_cross_entropy_with_logits.

  • tf.nn.sparse_softmax_cross_entropy_with_logits
sparse_softmax_cross_entropy_with_logits(
  _sentinel=None,
  labels=None,
  logits=None,
  name=None
)

labels: Tensor of shape [d_0d_1d_(r-1)] (where r is the rank of labels and result) and dtypeint32, or int64. Each entry in labels must be an index in [0num_classes). Other values will raise an exception when this operation is run on the CPU and return NaN for corresponding loss and gradient rows on the GPU.
logits: Unscaled log probabilities of shape [d_0d_1d_(r-1)num_classes] and dtypefloat32, or float64.

The preceding code computes sparse softmax cross entropy between logits and labels. The probability of a given label is considered exclusive. Soft classes are not allowed, and the label’s vector must provide a single specific index for the true class for each row of logits.

  • tf.nn.weighted_cross_entropy_with_logits
weighted_cross_entropy_with_logits(
  targets,
  logits,
  pos_weight,
  name=None
)

targets: A tensor of the same type and shape as logits.
logits: A tensor of type float32 or float64.
pos_weight: A coefficient to use on the positive examples.

This is similar to sigmoid_cross_entropy_with_logits() except that pos_weight allows a trade-off of recall and precision by up or down-weighting the cost of a positive error relative to a negative error.

Analyzing the Iris dataset with a Tensorflow feedforward network

Let’s look at a feedforward example using the Iris dataset.

You can download the dataset from https://github.com/ml-resources/neuralnetwork-programming/blob/ed1/ch02/iris/iris.csv and the target labels from https://github.com/ml-resources/neuralnetwork-programming/blob/ed1/ch02/iris/target.csv.

In the Iris dataset, we will use 150 rows of data made up of 50 samples from each of three Iris species: Iris setosa, Iris virginica, and Iris versicolor.

Petal geometry compared from three iris species:
Iris SetosaIris Virginica, and Iris Versicolor.

Iris Setosa, Iris Virginica, and Iris Versicolor

In the dataset, each row contains data for each flower sample: sepal length, sepal width, petal length, petal width, and flower species. Flower species are stored as integers, with 0 denoting Iris setosa, 1 denoting Iris versicolor, and 2 denoting Iris virginica.

First, we will create a run() function that takes three parameters–hidden layer size h_size, standard deviation for weights stddev, and Step size of Stochastic Gradient Descent sgd_step:

def run(h_size, stddev, sgd_step)

Input data loading is done using the genfromtxt function in numpy. The Iris data loaded has a shape of L: 150 and W: 4. Data is loaded in the all_X variable. Target labels are loaded from target.csv in all_Y with the shape of L: 150, W:3:

def load_iris_data():
    from numpy import genfromtxt
    data = genfromtxt('iris.csv', delimiter=',')
    target = genfromtxt('target.csv', delimiter=',').astype(int)
    # Prepend the column of 1s for bias
    L, W  = data.shape
    all_X = np.ones((L, W + 1))
    all_X[:, 1:] = data
    num_labels = len(np.unique(target))
    all_y = np.eye(num_labels)[target]
    return train_test_split(all_X, all_y, test_size=0.33, random_state=RANDOMSEED)

Once data is loaded, we initialize the weights matrix based on x_sizey_size, and h_size with standard deviation passed to the run() method:

  • x_size= 5
  • y_size= 3
  • h_size= 128 (or any other number chosen for neurons in the hidden layer)
# Size of Layers
x_size = train_x.shape[1] # Input nodes: 4 features and 1 bias
y_size = train_y.shape[1] # Outcomes (3 iris flowers)
# variables
X = tf.placeholder("float", shape=[None, x_size])
y = tf.placeholder("float", shape=[None, y_size])
weights_1 = initialize_weights((x_size, h_size), stddev)
weights_2 = initialize_weights((h_size, y_size), stddev)

Next, we make the prediction using sigmoid as the activation function defined in the forward_propagration() function:

def forward_propagation(X, weights_1, weights_2):
    sigmoid = tf.nn.sigmoid(tf.matmul(X, weights_1))
    y = tf.matmul(sigmoid, weights_2)
    return y

First, sigmoid output is calculated from input X and weights_1. This is then used to calculate y as a matrix multiplication of sigmoid and weights_2:

y_pred = forward_propagation(X, weights_1, weights_2)
predict = tf.argmax(y_pred, dimension=1)

Next, we define the cost function and optimization using gradient descent. Let’s look at the GradientDescentOptimizer being used. It is defined in the tf.train.GradientDescentOptimizer class and implements the gradient descent algorithm.

To construct an instance, we use the following constructor and pass sgd_step as a parameter:

# constructor for GradientDescentOptimizer
__init__(
  learning_rate,
  use_locking=False,
  name='GradientDescent'
)

Arguments passed are explained here:

  • learning_rate: A tensor or a floating point value. The learning rate to use.
  • use_locking: If True, use locks for update operations.
  • name: Optional name prefix for the operations created when applying gradients. The default name is "GradientDescent".

The following list shows the code to implement the cost function:

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=y_pred))
updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost)

Next, we will implement the following steps:

  1. Initialize the TensorFlow session:
sess = tf.Session()
  1. Initialize all the variables using tf.initialize_all_variables(); the return object is used to instantiate the session.
  2. Iterate over steps (1 to 50).
  3. For each step in train_x and train_y, execute updates_sgd.
  4. Calculate the train_accuracy and test_accuracy.

We stored the accuracy for each step in a list so that we could plot a graph:

    init = tf.initialize_all_variables()
    steps = 50
    sess.run(init)
    x  = np.arange(steps)
    test_acc = []
    train_acc = []
    print("Step, train accuracy, test accuracy")
    for step in range(steps):
        # Train with each example
        for i in range(len(train_x)):
            sess.run(updates_sgd, feed_dict={X: train_x[i: i + 1], y: train_y[i: i + 1]})
train_accuracy = np.mean(np.argmax(train_y, axis=1) ==
sess.run(predict, feed_dict={X: train_x, y: train_y}))
test_accuracy = np.mean(np.argmax(test_y, axis=1) ==
sess.run(predict, feed_dict={X: test_x, y: test_y}))

print("%d, %.2f%%, %.2f%%"
% (step + 1, 100. * train_accuracy, 100. * test_accuracy))

test_acc.append(100. * test_accuracy)
train_acc.append(100. * train_accuracy)

Code execution

Let’s run this code for h_size of 128, standard deviation of 0.1, and sgd_step of 0.01:

def run(h_size, stddev, sgd_step):
 ...
def main():
run(128,0.1,0.01)

if __name__ == '__main__':
main()

The preceding code outputs the following graph, which plots the steps versus the test and train accuracy:

test and train accuracy

Let’s compare the change in SGD steps and its effect on training accuracy. The following code is very similar to the previous code example, but we will rerun it for multiple SGD steps to see how SGD steps affect accuracy levels.

def run(h_size, stddev, sgd_steps):
    ....
    test_accs = []
    train_accs = []
    time_taken_summary = []
    for sgd_step in sgd_steps:
        start_time = time.time()
        updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost)
        sess = tf.Session()
        init = tf.initialize_all_variables()
        steps = 50
        sess.run(init)
        x  = np.arange(steps)
        test_acc = []
        train_acc = []
print("Step, train accuracy, test accuracy")

for step in range(steps):
# Train with each example
for i in range(len(train_x)):
sess.run(updates_sgd, feed_dict={X: train_x[i: i + 1], 
y: train_y[i: i + 1]})

train_accuracy = np.mean(np.argmax(train_y, axis=1) ==
sess.run(predict, 
feed_dict={X: train_x, y: train_y}))
test_accuracy = np.mean(np.argmax(test_y, axis=1) ==
sess.run(predict, 
feed_dict={X: test_x, y: test_y}))

print("%d, %.2f%%, %.2f%%"
% (step + 1, 100. * train_accuracy, 100. * test_accuracy))
#x.append(step)
test_acc.append(100. * test_accuracy)
train_acc.append(100. * train_accuracy)
end_time = time.time()
diff = end_time -start_time
time_taken_summary.append((sgd_step,diff))
t = [np.array(test_acc)]
t.append(train_acc)
train_accs.append(train_acc)

Output of the preceding code will be an array with training and test accuracy for each SGD step value. In our example, we called the function sgd_steps for an SGD step value of [0.01, 0.02, 0.03]:

def main():
    sgd_steps = [0.01,0.02,0.03]
    run(128,0.1,sgd_steps)
if __name__ == '__main__':
main()

This is the plot showing how training accuracy changes with sgd_steps. For an SGD value of 0.03, it reaches a higher accuracy faster as the step size is larger.

training accuracy

In this post, we built our first neural network, which was feedforward only, and used it for classifying the contents of the Iris dataset.

You enjoyed a tutorial from the book, Neural Network Programming with Tensorflow. To implement advanced neural networks like CCNs, RNNs, GANs, deep belief networks and others in Tensorflow, grab your copy today!

Read Next

Managing Editor, Packt Hub. Former mainframes/DB2 programmer turned marketer/market researcher turned editor. I love learning, writing and tinkering when I am not busy running after my toddler. Wonder how algorithms would classify this!

LEAVE A REPLY

Please enter your comment!
Please enter your name here