6 min read

“Deep learning (deep structured learning, hierarchical learning, or deep machine learning) is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using multiple processing layers, with complex structures or otherwise, composed of multiple nonlinear transformations.” –Wikipedia

High Performance Computing is not a new concept, but only recently the technical advances along with economies of scale have ensured that HPC is accessible to the masses with affordable yet powerful configurations. Anyone interested can buy commodity hardware and start working on Deep Learning, thus bringing a machine learning subset of artificial intelligence out of research labs and into garages.

DeepLearning.net is a starting point for more information about Deep Learning. Nvidia ParallelForAll is a nice resource for learning GPU-based Deep Learning (Core Concepts, History and Training, and Sequence Learning).

What is CNTK

Microsoft Research released its Computational Network Toolkit in January this year. CNTK is a unified deep-learning toolkit that describes neural networks as a series of computational steps via a directed graph.

CNTK allows the following models:

–          Feed Forward Deep Neural Networks (DNN)

–          Convolutional Neural Networks (CNN)

–          Recurrent Neural Networks (RNN)/Long Short Term Memory Units (LTSM)

–          Stochastic Gradient Descent (SGD)

Why CNTK? Better Scaling

When Microsoft CNTK was released, the stunning feature that it brought was distributed computing, that is, a developer was not limited by the number of GPUs installed on a single machine. This was a significant breakthrough because even the best of machines was limited by4-way SLI, thus limiting the total number of cores to 4 x 3072 = 12288.

The configuration of the developer machine put an extra load on the hardware configuration, because this configuration left very little room for upgrades. There is only one motherboard available that supports4-way PCI-E Gen3x16, and there are very few manufacturers who provide good quality 1600W watt power supply to support four Titans.

This meant that developers were forced to pay a hefty premium for upgradability in terms of the motherboard and processor, settling for an older generation processor. Distributed computing in High Performance Computing is essential, since it allows scaling out as opposed to scaling up. Developers can build grids with cheaper nodes and the latest processors with a lower hardware cost for an entry barrier.

Microsoft Research demonstrated in December 2015 that distributed GPU computing is most efficient in CNTK. In comparison, Google TensorFlow, FAIR Torch, andCaffe did not allow scaling beyond a single machine, and Theano was the worst, as it did not even scale on multiple GPUs on the same machine.

Google Research, on April 13, released support for distributed computing. The speed up claimed is 56X for 100 GPUs and 40X for 50 GPUs. The performance deceleration is sharp for any sizable distributed Machine Learning setup. I do not have any comparative performance figures for CNTK, but scaling with GPUs on a single machine for CNTKhad very good numbers.

GPU Performance

One of the shocking finds with my custom build commodity hardware (2xTitanX) was the TFLOPS achieved under Ubuntu 14.04 LTS and Windows 10. With the fully updated OS and latest drivers from NVIDIA, I got double the number of TFLOPS under Windows than Ubuntu. I would like to rerun the samples with Ubuntu 16.04 LTS, but until then, I have a clear winner in performance with Windows.

CNTK works perfectly on Windows, but TensorFlow has a dependency onBazel, which as of now does not build on Windows (Bug#947). Google can look into this and make TensorFlow work on windows, or Ubuntu &Nvidia can achieve the same TFLOPS as Windows. But until that time, architects have two options:to either settle for lower TFLPOS under Ubuntu with TensorFlow, or migrate to CNTK with increased performance.

Getting started with CNTK

Let’s see how toget started with CNTK.

Binary Installation

Currently, the CNTK binary installation is the easiest way to get started with CNTK. Just follow the instructions. The only downside is that the currently available binaries are compiled with CUDA 7.0, rather than the latest CUDA 7.5 (released almost a year ago).

Codebase Compilation

If you want to learn CNTK in detail, and if you are feeling adventurous, you should try compiling CNTK from source. Compile the code base, even when you do not expect to use the generated binary, because the whole compilation process will be a good peek under the hood and enhance your understanding of Deep Learning.

The instructions for the Windows installation are available here, whereas the Linux installation instructions are available here. If you want to enable 1-bit Stochastic Gradient Descent (1bit-SGD), you should follow these instructions.1bit-SGD is licensed more restrictively, and you have to understand the differences if you are looking for commercial deployments.

Windows Compilation is characterized by older versions of libraries. NvidiaCUDA andcuDNN were recently updated to 7.5, whereas other dependencies such asNvidia CUB, Boost, and OpenCV are still using older versions. Kindly pay extra attention to the versions listed in the documentation to ensure smooth compilation. Nvidia has updated the support for its Nsight to Visual Studio 2015; however, Microsoft CNTK still supportsonly Visual Studio 2013.


To test the CNTK installation, here are some really great samples:

Alternative Deep Learning Toolkits


Theano is possibly the oldest Deep Learning Framework available. The latest release, 0.8,which was released on March 16, enables the much awaited multi-GPU support (there are no indications of distributed computing support though).cuDNNv5 and CNMeM are also supported. A detailed report is available here. Python bindings are available.


Caffe is a deep learning framework primarily oriented towards image processing. Python bindings are available.

Google TensorFlow

TensorFlow is a deep learning framework written in C++ with Python API bindings. The computation graph is pure Python, making it slower than other frameworks, as demonstrated by benchmarks.

Google has been pushing Go for a long time now, and it has even open sourced the language. But when it came to TensorFlow, Python was chosen over Go. There are concerns about Google supporting commercial implementations.

FAIR Torch

Facebook AI Research (Fair) has release its extension to Torch7. Torch is a scientific computing framework with Lua as its primary language. Lua has certain advantages over Python (lower interpreter overhead, simpler integration with C code), which lend themselves to Torch. Moreover, multi-core using OpenMP directives points to better performance.


Leaf is the latest addition to the machine learning framework. It is based on the Rust programming language (supposed to replace C/C++). Leaf is a framework created by hackers for hackers rather than scientists. Leaf has some nice performance improvements.


Deep Learning with GPUs is an emerging field, and there is much required to be done to make good products out of machine learning. So every product needs to evaluate all of the possible alternatives (programming language, operating system, drivers, libraries, and frameworks) available for specific use-cases. Currently, there is no one-size-fits-all approach available.

About the author

SarvexJatasra is a Technology Aficionado, exploring ways to apply technology to make lives easier. He is currently working as the Chief Technology Officer at 8Minutes. When not in touch with technology, he is involved in physical activities such as swimming and cycling.


Please enter your comment!
Please enter your name here