2 min read

Last week, the MLPerf effort released the results for MLPerf Training v0.6, the second round of results from their machine learning training performance benchmark suite. These benchmarks are used by the AI practitioners to adopt common standards for measuring the performance and speed of hardware that is used to train AI models. As per these benchmark results, Nvidia and Google Cloud set new AI training time performance records.

MLPerf v0.6 studies the training performance of machine learning acceleration hardware in 6 categories including image classification, object detection (lightweight), object detection (heavyweight), translation (recurrent), translation (non-recurrent) and reinforcement learning.

MLPerf is an association of more than 40 companies and researchers from leading universities, and the MLPerf benchmark suites are being the industry standard for measuring machine learning performance. 

As per the results, Nvidia’s Tesla V100 Tensor Core GPUs used an Nvidia DGX SuperPOD for completing on-premise training of the ResNet-50 model for image classification in 80 seconds. Also, Nvidia turned out to be the only vendor who submitted results in all six categories. In 2017, when Nvidia launched the DGX-1 server, it took 8 hours to complete model training.

In a statement to ZDNet, Paresh Kharya, director of Accelerated Computing for Nvidia said, “The progress made in just a few short years is staggering.” He further added, “The results are a testament to how fast this industry is moving.”

Google Cloud entered five categories and had set three records for performance at scale with its Cloud TPU v3 Pods. Google Cloud Platform (GCP) set three new performance records in the latest round of the MLPerf benchmark competition. The three record-setting results ran on Cloud TPU v3 Pods, are Google’s latest generation of supercomputers, built specifically for machine learning. 

The speed of Cloud TPU Pods was better and used less than two minutes of compute time. The TPU v3 Pods also showed the record performance results in machine translation from English to German of the Transformer model within 51 seconds. Cloud TPU v3 Pods train models over 84% faster than the fastest on-premise systems in the MLPerf Closed Division.

TPU pods has also achieved record performance in the image classification benchmark of the ResNet-50 model with the ImageNet data set, as well as model training in another object detection category in 1 minute and 12 seconds.

In a statement to ZDNet, Google Cloud’s Zak Stone said, “There’s a revolution in machine learning.” He further added, “All these workloads are performance-critical. They require so much compute, it really matters how fast your system is to train a model. There’s a huge difference between waiting for a month versus a couple of days.”

Read Next

Google suffers another Outage as Google Cloud servers in the us-east1 region are cut off

Google Cloud went offline taking with it YouTube, Snapchat, Gmail, and a number of other web services

Google Cloud introduces Traffic Director Beta, a networking management tool for service mesh