Deep Learning with Torch

9 min read

Torch is a scientific computing framework built on top of Lua[JIT]. The nn package and the ecosystem around it provide a very powerful framework for building deep learning models, striking a perfect balance between speed and flexibility. It is used at Facebook AI Research(FAIR), Twitter Cortex, DeepMind, Yann LeCun’s group at NYU, Fei-Fei Li’s at Stanford, and many more industrial and academic labs. If you are like me, and don’t like writing equations for backpropagation every time you want to try a simple model, Torch is a great solution. With Torch, you can also do pretty much anything you can imagine, whether that is writing custom loss functions, dreaming up an arbitrary acyclic graph network, or even using multiple GPUs or loading pre-trained models on imagenet from caffe model-zoo (yes, you can load models trained in caffe with a single line). Without further ado, let’s jump right into the awesome world of deep learning.


On Ubuntu 12+ and Mac OS X, installing Torch looks like this:

# in a terminal, run the commands WITHOUT sudo
$ git clone ~/torch --recursive
$ cd ~/torch; bash install-deps;
$ ./
# On Linux with bash
$ source ~/.bashrc
# On OSX or in Linux with no bash.
$ source ~/.profile

Once you’ve installed Torch, you can run a Torch script using:

$ th script.lua
# alternatively you can fire up a terminal torch interpreter using th -i
$ th -i
# and run multiple scripts one by one, the variables will be accessible to other scripts
> dofile 'script1.lua'
> dofile 'script2.lua'
> print(variable) -- variable from either of these scripts.

The sections below are very code intensive, but you can run these commands from Torch’s terminal interpreter.

$th -i

Building a Model: The Basics

A module is the basic building block of any Torch model. It has forward and backward methods for forward and backward passes of backpropagation. You can combine them using containers, and of course, calling forward and backward on containers propagates inputs and gradients correctly.

-- A simple mlp model with sigmoids

require 'nn'
linear1 = nn.Linear(100,10) -- A linear layer Module
linear2 = nn.Linear(10,2)

-- You can combine modulues using containers, sequential is the most used one
model = nn.Sequential() -- A container

-- the forward step
input = torch.rand(100)
target = torch.rand(2)
output = linear:forward(input)

Now we need a criterion to measure how well our model is performing, in other words, a loss function. nn.Criterion is the abstract class that all loss functions inherit. It provides forward and backward methods, computing loss and gradients respectively. Torch provides most of the commonly used criterions out of the box. It isn’t much of an effort to write your own either.

criterion = nn.MSECriterioin() -- mean squared error criterion
loss = criterion:forward(output,target)
gradientsAtOutput = criterion:backward(output,target)
-- To perform the backprop step, we need to pass these gradients to the backward
-- method of the model
gradAtInput = model:backward(input,gradientsAtOutput)
lr = 0.1  -- learning rate for our model
model:updateParameters(lr)  -- updates the parameters using the lr parameter.

The updateParameters method just subtracts the model parameters by gradients scaled by the learning rate. This is the vanilla stochastic gradient descent. Typically, the updates we do are more complex. For example, if we want to use momentum, we need to keep a track of updates we did in the previous epoch. There are a lot more fancy optimization schemes such as RMSProp, adam, adagrad, and L-BFGS that do more complex things like adapting learning rate, momentum factor, and so on. The optim package provides optimization routines out of the box.


We’ll use the German Traffic Sign Recognition Benchmark(GTSRB) dataset. This dataset has 43 classes of traffic signs of varying sizes, illuminations and occlusions. There are 39,000 training images and 12,000 test images. Traffic signs in each of the images are not centered and they have a 10% border around them.

I have included a shell script for downloading the data along with the code for this tutorial in this github repo.[1]

git clone
cd tutorial.gtsrb.torch/datasets


Let’s build a downsized vgg style model with what we’ve learned.

function createModel()
require 'nn'
nbClasses = 43
local net = nn.Sequential()

--[[building block: adds a convolution layer, batch norm layer and a relu activation to the net]]--
function ConvBNReLU(nInputPlane, nOutputPlane)

The code in the repo is much more polished than the snippets in the tutorial. It is modular and allows you to change the model and/or datasets easily.

-- kernel size = (3,3), stride = (1,1), padding = (1,1)
net:add(nn.SpatialConvolution(nInputPlane, nOutputPlane, 3,3, 1,1, 1,1))

  return net

The first layer contains three input channels because we’re going to pass RGB images (three channels). For grayscale images, the first layer has one input channel. I encourage you to play around and modify the network.[2]

There are a bunch of new modules that need some elaboration. The Dropout module randomly deactivates a neuron with some probability. It is known to help generalization by preventing co-adaptation between neurons; that is, a neuron should now depend less on its peer, forcing it to learn a bit more. BatchNormalization is a very recent development. It is known to speed up convergence by normalizing the outputs of a layer to unit gaussian using the statistics of a batch.

Let’s use this model and train it. In the interest of brievity, I’ll use these constructs directly. The code describing these constructs is in datasets/gtsrb.lua.

  • DataGen:trainGenerator(batchSize)
  • DataGen:valGenerator(batchSize)

These provide iterators over batches of train and test data respectively.

You’ll find that the model code (models/vgg_small.lua) in the repo is different. It is designed to allow you to experiment quickly.

Using optim to train the model

Using a stochastic gradient descent (sgd) from the optim package to minimize a function f looks like this:

optim.sgd(feval, params, optimState)


  • feval: A user-defined function that respects the API: f, df/params = feval(params)
  • params: The current parameter vector (a 1D torch.Tensor)
  • optimState: A table of parameters, and state variables, dependent upon the algorithm

Since we are optimizing the loss of the neural network, parameters should be the weights and other parameters of the network. We get these as a flattened 1D tensor using model:getParameters. It also returns a tensor containing the gradients of these parameters. This is useful in creating the feval function above.

model = createModel()
criterion = nn.ClassNLLCriterion() -- criterion we are optimizing: negative log loss

params, gradParams = model:getParameters()

local function feval()
  -- criterion.output stores the latest output of criterion
  return criterion.output, gradParams

We need to create an optimState table and initialize it with a configuration of our optimizer like learning rate and momentum:

optimState = {
      learningRate = 0.01,
      momentum = 0.9,
      dampening = 0.0,
      nesterov = true,

Now, an update to the model should do the following:

  1. Compute the output of the model using model:forward().
  2. Compute the loss and the gradients at output layer using criterion:forward() and criterion:backward() respectively.
  3. Update the gradients of the model parameters using model:backward().
  4. Update the model using optim.sgd.
-- Forward pass
output = model:forward(input)
loss = criterion:forward(output, target)

-- Backward pass
critGrad = criterion:backward(output, target)
model:backward(input, critGrad)

-- Updates
optim.sgd(feval, params, optimState)

Note: The order above should be respected, as backward assumes forward was run just before it. Changing this order might result in gradients not being computed correctly.

Putting it all together

Let’s put it all together and write a function that trains the model for an epoch. We’ll create a loop that iterates over the train data in batches and updates the model.

model = createModel()
criterion = nn.ClassNLLCriterion()
dataGen = DataGen('datasets/GTSRB/') -- Data generator

params, gradParams = model:getParameters()

batchSize = 32
optimState = {
      learningRate = 0.01,
      momentum = 0.9,
      dampening = 0.0,
      nesterov = true,

function train()
  -- Dropout and BN behave differently during training and testing
  -- So, switch to training mode

  local function feval()
      return criterion.output, gradParams
  for input, target in dataGen:trainGenerator(batchSize) do
      -- Forward pass
      local output = model:forward(input)
      local loss = criterion:forward(output, target)
      -- Backward pass
      model:zeroGradParameters() -- clear grads from previous update
      local critGrad = criterion:backward(output, target)
      model:backward(input, critGrad)

      -- Updates
      optim.sgd(feval, params, optimState)

The test function is extremely similar, except that we don’t need to update the parameters:

confusion = optim.ConfusionMatrix(nbClasses) -- to calculate accuracies

function test()
  model:evaluate()  -- switch to evaluate mode
  confusion:zero()  -- clear confusion matrix

  for input, target in dataGen:valGenerator(batchSize) do
    local output = model:forward(input)
    confusion:batchAdd(output, target)
  local test_acc = confusion.totalValid * 100 
  print(('Test accuracy: %.2f'):format(test_acc))

Now that everything is set, you can train your network and print the test accuracies:

max_epoch = 20
for i = 1,20 do

An epoch takes around 30 seconds on a TitanX and gives about 97.7% accuracy after 20 epochs. This is a very basic model and honestly I haven’t tried optimizing the parameters much. There are a lot of things that can be done to crank up the accuracies.

  • Try different processing procedures.
  • Experiment with the net structure.
  • Different weight initializations, and learning rate schedules.
  • An Ensemble of different models; for example, train multiple models and take a majority vote.

You can have a look at the state of the art on this dataset here. They achieve upwards of 99.5% accuracy using a clever method to boost the geometric variation of CNNs.


We looked at how to build a basic mlp in Torch. We then moved on to building a Convolutional Neural Network and trained it to solve a real-world problem of traffic sign recognition.

For a beginner, Torch/LUA might not be as easy. But once you get a hang of it, you have access to a deep learning framework which is very flexible yet fast. You will be able to easily reproduce latest research or try new stuff unlike in rigid frameworks like keras or nolearn. I encourage you to give it a fair try if you are going anywhere near deep learning.


About the author

Preetham Sreenivas is a data scientist at Fractal Analytics. Prior to that, he was a software engineer at Directi.


Please enter your comment!
Please enter your name here