3 min read

Today, Google cloud announced the alpha availability of ‘Cloud TPU Pods’  that are tightly-coupled supercomputers built with hundreds of Google’s custom Tensor Processing Unit (TPU) chips and dozens of host machines, linked via an ultrafast custom interconnect. Google states that these pods make it easier, faster, and more cost-effective to develop and deploy cutting-edge machine learning workloads on Google Cloud. Developers can iterate over the training data in minutes and train huge production models in hours or days instead of weeks.

The Tensor Processing Unit (TPU), is an ASIC that powers several of Google’s major products, including Translate, Photos, Search, Assistant, and Gmail. It provides up to 11.5 petaflops of performance in a single pod.

Features of Cloud TPU Pods

#1 Proven Reference Models

Customers can take advantage of  Google-qualified reference models that are optimized for performance, accuracy, and quality for many real-world use cases. These include object detection, language modeling, sentiment analysis, translation, image classification, and more.

#2 Connect Cloud TPUs to Custom Machine Types

Users can connect to Cloud TPUs from custom VM types. This will them optimally balance processor speeds, memory, and high-performance storage resources for their individual workloads.

#3 Preemptible Cloud TPU

Preemptible Cloud TPUs are 70% cheaper than on-demand instances. Long training runs with checkpointing or batch prediction on large datasets can now be done at an optimal rate using Cloud TPU’s.

#4 Integrated with GCP

Cloud TPUs and Google Cloud’s Data and Analytics services are fully integrated with other GCP offerings. This provides developers unified access across the entire service line. Developers can run machine learning workloads on Cloud TPUs and benefit from Google Cloud Platform’s storage, networking, and data analytics technologies.

#5 Additional features

Cloud TPUs perform really well at synchronous training. The Cloud TPU software stack transparently distributes ML models across multiple TPU devices in a Cloud TPU Pod to help customers achieve scalability. All Cloud TPUs are integrated with Google Cloud’s high-speed storage systems, ensuring that data input pipelines can keep up with the TPUs. Users do not have to manage parameter servers, deal with complicated custom networking configurations, or set up exotic storage systems to achieve unparalleled training performance in the cloud.

Performance and Cost benchmarking of Cloud TPU

Google compared the Cloud TPU Pods and Google Cloud VMs with NVIDIA Tesla V100 GPUs attached- using one of the MLPerf models called TensorFlow 1.12 implementations of ResNet-50 v1.5 (GPU version, TPU version). They trained ResNet-50 on the ImageNet image classification dataset.

The results of the test show that Cloud TPU Pods deliver near-linear speedups for large-scale training task; the largest Cloud TPU Pod configuration tested (256 chips) delivers a 200X speedup over an individual V100 GPU. Check out their methodology page for further details on this test. Training ResNet-50 on a full Cloud TPU v2 Pod costs almost 40% less than training the same model to the same accuracy on an n1-standard-64 Google Cloud VM with eight V100 GPUs attached. The full Cloud TPU Pod completes the training task 27 times faster.

Head over to Google Cloud’s official page to know more about Cloud TPU Pods. Alternatively, check out Cloud TPU’s documentation for more insights on the same.

Read Next

Intel Optane DC Persistent Memory available first on Google Cloud
Google Cloud Storage Security gets an upgrade with Bucket Lock, Cloud KMS keys and more
Oracle’s Thomas Kurian to replace Diane Greene as Google Cloud CEO; is this Google’s big enterprise Cloud market move?