Researchers from SenseTime Research and Nanyang Technological University have broken the record to train ImageNet/AlexNet in 1.5 minutes. The previous record was held by a model developed by the researchers at TenCent, a Chinese tech giant, and Hong Kong Baptist University, that took four minutes. This is a significant 2.6 times speedup over the previous record.
The SenseTime and Nanyang team used a communication backend called “GradientFlow” along with a set of network optimization techniques to reduce the deep neural network (DNN) model training time. The researchers also proposed a technique called “lazy allreduce” to combine multiple communication operations into a single one.
The researchers say that high communication overhead is one of the major performance bottlenecks for distributed DNN training across multiple GPUs. To combat this issue, one of the techniques used was increasing the batch size and running through the dataset quickly to process more samples per iteration. They also used a mixture of half-precision floating point, aka FP16, as well as single-precision floating point, FP32. Both these techniques reduce the memory bandwidth pressure on the GPUs used to accelerate the machine-learning math in hardware, but cause some loss of accuracy.
How does GradientFlow work?
GradientFlow is a software toolkit, to tackle the high communication cost of distributed DNN training. It is a communication backend that sla shed training times on GPUs, as described in their paper, published earlier this month. GradientFlow employs lazy allreduce, to reduce network cost, and improves network throughput by fusing multiple allreduce operations into a single one. It employs “coarse-graining sparse communication” to reduce network traffic and sends only important gradient chucks.
Every GPU stores batches of data from ImageNet and uses gradient descent to crunch through their pixels. These gradient values are passed onto server nodes in order to update the parameters in the overall model. This is done using a type of parallel-processing algorithm known as allreduce.
Trying to ingest these values, or tensors, from hundreds of GPUs at a time will result into bottlenecks. GradientFlow increases the efficiency of the code by allowing the GPUs to communicate and exchange gradients locally before final values are sent to the model. “Instead of immediately transmitting generated gradients with allreduce, GradientFlow tries to fuse multiple sequential communication operations into a single one, avoiding sending a huge number of small tensors via network,” the researchers wrote.
Lazy allreduce fuses multiple allreduce operations into a single operation with minimal GPU memory copy overhead. On completing a backward computation, a layer with learnable parameters generates one or more gradient tensors. Every tensor is allocated a separate GPU memory space by the baseline system. With the help of lazy allreduce, all gradient tensors are
placed in a memory pool.
Lazy allreduce waits for the lower layer’s gradient tensors, until the total size of waited tensors is greater than a given threshold θ. Then, a single allreduce operation is performed on all waited gradient tensors. This avoids transmitting small tensors via network and improves network utilization.
Coarse-grained sparse communication (CSC)
To further reduce network traffic with high bandwidth utilization, the researchers have proposed coarse-grained sparse communication to select important gradient chunks for allreduce.
The generated tensors are placed in a memory pool with continuance address space, based on their generated order. The CSC will equally partition the gradient memory pool into chunks. Each chunk contains a number of gradients. In this research, each chunk contains 32K gradients and the CSC partitions the gradient memory pool of AlexNet and ResNet-50 into 1903 and 797 chunks respectively. A percent (e.g., 10%) of gradient chunks are selected as important chunks at the end of each iteration.
Design of coarse-grained sparse communication (CSC)
GradientFlow improves network performance for distributed DNN training. When training ImageNet/AlexNet on 512 GPUs, the researchers achieved up to 410.2 speedup ratio, and completed 95-epoch training in 1.5 minutes, outperforming existing approaches.
You can head over to the research paper for a more in-depth performance analysis of the model proposed.
Generating automated image captions using NLP and computer vision [Tutorial]
Facebook’s artificial intelligence research team, FAIR, turns five. But what are its biggest accomplishments?
Exploring Deep Learning Architectures [Tutorial]