Tensorflow: Next Gen Machine Learning

Last November, Google open sourced its shiny Machine Intelligence package, promising a simpler way to develop deep learning algorithms that can be deployed anywhere, from your phone to a big cluster without a hassle. They even take advantage of running over GPUs for better performance.

Let's Give It a Shot!

First things first, let's install it:

# Ubuntu/Linux 64-bit, CPU only (GPU enabled version requires more deps):
$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.1-cp27-none-linux_x86_64.whl

# Mac OS X, CPU only:
$ sudo easy_install --upgrade six
$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.1-cp27-none-any.whl

We are going to play with the old-known iris dataset, where we will train a neural network to take dimensions of the sepals and petals of an iris plant and classify it between three different types of iris plants: Iris setosa, Iris versicolour, and Iris virginica. You can download the training CSV dataset from here.

Reading the Training Data

Because TensorFlow is prepared for cluster-sized data, it allows you to define an input by feeding it with a queue of filenames to process (think of MapReduce output shards). In our simple case, we are going to just hardcode the path to our only file:

import tensorflow as tf


def inputs():

    filename_queue = tf.train.string_input_producer(["iris.data"])

We then need to set up the Reader, which will work with the file contents. In our case, it's a TextLineReader that will produce a tensor for each line of text in the dataset:

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

Then we are going to parse each line into the feature tensor of each sample in the dataset, specifying the data types (in our case, they are all floats except the iris class, which is a string).

# decode_csv will convert a Tensor from type string (the text line) in
# a tuple of tensor columns with the specified defaults, which also
# sets the data type for each column

sepal_length, sepal_width, petal_length, petal_width, label = 
tf.decode_csv(value, record_defaults=[[0.0], [0.0], [0.0], [0.0], [""]])

# we could work with each column separately if we want; but here we 
# simply want to process a single feature vector containing all the 
# data for each sample.
features = tf.pack([sepal_length, sepal_width, petal_length, petal_width])

Finally, in our data file, the samples are actually sorted by iris type. This would lead to bad performance of the model and make it inconvenient for splitting between training and evaluation sets, so we are going to shuffle the data before returning it by using a tensor queue designed for it. All the buffering parameters can be set to 1500 because that is the exact number of samples in the data, so will store it completely in memory. The batch size will also set the number of rows we pack in a single tensor for applying operations in parallel:

return tf.train.shuffle_batch([features, label],
                                  batch_size=100,
                                  capacity=1500,
                                  min_after_dequeue=100)

Converting the Data

Our label field on the training dataset is a string that holds the three possible values of the Iris class. To make it friendly with the neural network output, we need to convert this data to a three-column vector, one for each class, where the value should be 1 (100% probability) when the sample belongs to that class. This is a typical transformation you may need to do with input data.

def string_label_as_probability_tensor(label):

    is_setosa = tf.equal(label, ["Iris-setosa"])
    is_versicolor = tf.equal(label, ["Iris-versicolor"])
    is_virginica = tf.equal(label, ["Iris-virginica"])

    return tf.to_float(tf.pack([is_setosa, is_versicolor, is_virginica]))

The Inference Model (Where the Magic Happens)

We are going to use a single neuron network with a Softmax activation function. The variables (learned parameters of our model) will only be the matrix weights applied to the different features for each sample of input data.

# model: inferred_label = softmax(Wx + b)
# where x is the features vector of each data example

W = tf.Variable(tf.zeros([4, 3]))
b = tf.Variable(tf.zeros([3]))

def inference(features):

    # we need x as a single column matrix for the multiplication
    x = tf.reshape(features, [1, 4])

    inferred_label = tf.nn.softmax(tf.matmul(x, W) + b)

    return inferred_label

Notice that we left the model parameters as variables outside of the scope of the function. That is because we want to use those same variables both while training and when evaluating and using the model.

Training the Model

We train the model using backpropagation, trying to minimize cross entropy, which is the usual way to train a Softmax network. At a high level, this means that for each data sample, we compare the output of the inference with the real value and calculate the error (how far we are). Then we use the error value to adjust the learning parameters in a way that minimizes that error.

We also have to set the learning factor; it means for each sample, how much of the computed error we will apply to correct the parameters. There has to be a balance between the learning factor, the number of learning loop cycles, and the number of samples we pack tighter in the same tensor in batch; the bigger the batch, the smaller the factor and the higher the number of cycles.

def train(features, tensor_label):

    inferred_label = inference(features)
    cross_entropy = -tf.reduce_sum(tensor_label*tf.log(inferred_label))
    train_step = tf.train.GradientDescentOptimizer(0.001)
        .minimize(cross_entropy)

    return train_step

Evaluating the Model

We are going to evaluate our model using accuracy, which is the ratio of cases where our network identifies the right iris class over the total evaluation samples.

def evaluate(evaluation_features, evaluation_labels):

    inferred_label = inference(evaluation_features)
    correct_prediction = tf.equal(tf.argmax(inferred_label, 1), tf.argmax(evaluation_labels, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    return accuracy

Running the Model

We are only left to connect our graph and run it in a session, where the defined operations are actually going to use the data. We also split our input data between training and evaluation around 70%:30%, and run a training loop with it 1,000 times.

features, label = inputs()
tensor_label = string_label_as_probability_tensor(label)

train_step = train(features[0:69, 0:4], tensor_label[0:69, 0:3])

evaluate_step = evaluate(features[70:99, 0:4], tensor_label[70:99, 0:3])

with tf.Session() as sess:

    sess.run(tf.initialize_all_variables())

    # Start populating the filename queue.
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    for i in range(1000):
        sess.run(train_step)

    print sess.run(evaluate_step)

    # should print 0 => setosa
    print sess.run(tf.argmax(inference([[5.0, 3.6, 1.4, 0.2]]), 1))

    # should be 1 => versicolor
    print sess.run(tf.argmax(inference([[5.5, 2.4, 3.8, 1.1]]), 1))

    # should be 2 => virginica
    print sess.run(tf.argmax(inference([[6.9, 3.1, 5.1, 2.3]]), 1))

    coord.request_stop()
    coord.join(threads)
    sess.closs()

If you run this, it should print an accuracy value close to 1. This means our network correctly classifies the samples in almost 100% of the cases, and also we are providing the right answers for the manual samples to the model.

Conclusion

Our example was very simple, but TensorFlow actually allows you to do much more complicated things with similar ease, such as working with voice recognition and computer vision. It may not look much different than using any other deep learning or math packages, but the key is the ability to run the expressed model in parallel. Google is willing to create a mainstream DSL to express data algorithms focused on machine learning, and they may succeed in doing so. For instance, although Google has not yet open sourced the distributed version of the engine, a tool capable of running Tensorflow-modeled graphs directly over an Apache Spark cluster was just presented at the Spark Summit, which shows that the community is interested in expanding its usage.

About the author

Ariel Scarpinelli is a senior Java developer in VirtualMind and is a passionate developer with more than 15 years of professional experience. He can be found on Twitter at @ triforcexp.