2 min read

Yesterday (on the 7th of November), Facebook open-sourced its high-performance kernel library FBGEMM: Facebook GEneral Matrix Multiplication. This library offers optimized on-CPU performance for reduced precision calculations used to accelerate deep learning models. The library has delivered 2x performance gains when deployed at Facebook (in comparison to their current production baseline).

Users can deploy it using the Caffe2 front end, and it will soon be callable directly by PyTorch 1.0 Python front end.

Features of FBGEMM

1. FBGEMM is optimized for server-side inference. It delivers accuracy and efficiency when performing quantized inference using contemporary deep learning frameworks.

  1. It is a low-precision, high-performance matrix-matrix multiplications and convolution library that enables large-scale production servers to run the most powerful deep learning models efficiently.
  1. The library exploits opportunities to overcome the unique challenges of matrix multiplication at lower precision with bandwidth-bound pre- and post-GEMM operations.
  1. At Facebook, FBGEMM has benefited many AI services, increased the speed of English-to-Spanish translations by 1.3x, reduced DRAM bandwidth usage in their recommendation system used in feeds by 40%, and speed up character detection by 2.4x in Rosetta, the machine learning system for understanding text in images and videos.
  1. FBGEMM supplies modular building blocks to construct an overall GEMM pipeline needed by plugging and playing different front-end and back-end components. It combines small compute with bandwidth-bound operations and exploits cache locality by fusing post-GEMM operations with macro kernel while providing support for accuracy-loss-reducing operations.

Why does GEMM matter?

Floating point operations (FLOPs)  are mostly consumed by Fully connected (FC) operators in the deep learning models that are  deployed in Facebook’s data centers. These FC operators are just plain GEMM, which means that their overall efficiency directly depends on GEMM efficiency. 19% of these deep learning frameworks at Facebook implement convolution as im2col followed by GEMM. However, straightforward im2col adds overhead from the copy and replication of input data. To combat this, some deep learning libraries implement direct (im2col-free) convolution for improved efficiency. Facebook provides a way to fuse im2col with the main GEMM kernel to minimize im2col overhead.

Facebook  says that recent industry and research works have indicated that inference using mixed-precision works well- without adversely affecting accuracy. FBGEMM uses this as an alternative strategy to improve inference performance with quantized models. Also, newer generations of GPUs, CPUs, and specialized tensor processors natively support lower-precision compute primitives, and hence the deep learning community is moving toward low-precision models. FBGEMM provides a way to perform efficient quantized inference on the current and upcoming generation of CPUs.

Head over to Facebook’s official blog to understand more about this library and how it is implemented.

Read Next

A new data breach on Facebook due to malicious browser extensions allowed almost 81,000 users’ private data up for sale, reports BBC News

90% Google Play apps contain third-party trackers, share user data with Alphabet, Facebook, Twitter, etc: Oxford University Study

Facebook open sources a set of Linux kernel products including BPF, Btrfs, Cgroup2, and others to address production issues