4 min read

Just two days ago the research team at OpenAI developed Sparse Transformer, a deep neural network that sets new records at predicting what comes next in a sequence, be it text, images, or sound. This transformer uses an algorithmic improvement of the attention mechanism for extracting patterns from sequences that are 30 times longer.

This Transformer incorporates an O(N \sqrt{N}) reformulation of the O(N^2) Transformer self-attention mechanism with several other improvements on rich data types. Initially, the models used on these data were designed for one domain. Also, it was difficult to scale to sequences more than a few thousand elements long.

The new Sparse Transformer can model sequences with tens of thousands of elements with hundreds of layers for achieving state-of-the-art performance across multiple domains. With this technique, the researchers aim to build AI systems that possess a greater ability to understand the world.

The team also introduced several other changes to the Transformer which includes a restructured residual block and weight initialization for improving the training of very deep networks.The team also introduced a set of sparse attention kernels that efficiently compute subsets of the attention matrix. The team further experimented on recomputation of attention weights during the backward pass to reduce memory usage.

Initial Experimentation with Deep Attention

In Transformers, ‘attention’ is defined as a process where every output element is connected to every input element, and the weightings between them are dynamically calculated based upon the circumstances. Transformers are more flexible than models with fixed connectivity patterns. These Transformers can consume large amounts of memory while being applied to data types with many elements, like images or raw audio.

One way of reducing this memory consumption is by recomputing the attention matrix from checkpoints during backpropagation which is a well-established technique in deep learning for reducing memory usage. However, the major issue with recomputing the attention matrix was that it was reducing memory usage at the cost of more computation and also, it couldn’t deal with large inputs.

To overcome this, the OpenAI researchers introduced Sparse Attention.

Using Sparse Attention patterns for large inputs

For very large inputs, computing a single attention matrix can become impractical. The OpenAI researchers instead opted for sparse attention patterns, where each of the output position computes weightings from a subset of input positions. In the entire process, the researchers first visualized the learned attention patterns for deep Transformers on images and then found out that many showed interpretable and structured sparsity patterns. The team also realized that the input portions are focused on small subsets and they show a high degree of regularity.

The researchers also implemented a two-dimensional factorization of the attention matrix, where the network can attend to all positions through two steps of sparse attention. They implemented it to preserve the ability of their network to learn new patterns.

The first version is strided attention which is roughly equivalent to each position attending to its row and its column and is a bit similar to the attention pattern.

The second version is fixed attention which attends to a fixed column and the elements after the latest column element. According to the researchers, it is a useful pattern and can be used when the data doesn’t fit into a two-dimensional structure.

Testing Sparse Transformers on density modeling tasks

The researchers test their architecture on density modeling tasks including natural images, text, and raw audio using CIFAR-10, Enwik8, and Imagenet 64 datasets respectively.. The team trained strided Sparse Transformers on CIFAR-10 images represented as sequences of 3072 bytes. They also trained models on the EnWik8 dataset for representing the first 108 bytes of Wikipedia containing variability in the periodic structure. They further trained on the version of downsampled ImageNet 64.

The researchers found out that sparse attention achieved lower loss than full attention and it is also faster.

Future scope and limitations

According to the researchers, the sparse attention patterns are only preliminary steps in the direction of efficient modeling of long sequences. The researchers think that exploring different patterns and combinations of sparsity is useful and learning sparse patterns is a promising avenue of research for the next generation of neural network architectures.

According to them, the autoregressive sequence generation still seems impractical for very high-resolution images or video. The optimized attention operations may prove to be useful for modeling high dimensional data, like multi-scale approaches.

This is just an overview of the Sparse Transformer architecture. For more detailed information, we recommend you to read the research paper.

Read Next

OpenAI Five bots destroyed human Dota 2 players this weekend

OpenAI Five beats pro Dota 2 players; wins 2-1 against the gamers

OpenAI introduces Neural MMO, a multiagent game environment for reinforcement learning agents