Amazon Alexa AI researchers develop new method to compress Neural Networks and preserves accuracy of system

3 min read

At the 33rd conference of the Association for the Advancement of Artificial Intelligence (AAAI), Amazon Alexa researchers in collaboration with researchers from University of Texas will be presenting paper that describes a new method for compressing neural networks which will, in turn, increase the performance of the network.

Yesterday, on the Amazon Blog, Anish Acharya and Rahul Goel, both applied scientists at Amazon Alexa AI, explained how huge neural networks tend to slow down the performance of a system. The proposed paper called ‘”Online Embedding Compression for Text Classification using Low Rank Matrix Factorization”, includes a method to compress embedding tables that often compromises the NLU network’s performance thus slowing down AI based systems like Alexa. This will help Alexa perform more and more complex tasks in milliseconds.

The researchers covered the following topics within the paper:

A compression method for deep NLP models to reduce the memory footprint using low-rank matrix factorization of the embedding layer. This lead to accuracy through further fine tuning.
They depicted that their method outperformed baselines like fixed-point quantization and offline embedding compression for sentence classification.
They provide an analysis of inference time for their method
Introduce CALR, a novel learning rate scheduling algorithm for gradient descent based optimization. They further depicted how CALR outperformed other popular adaptive learning rate algorithms on sentence classification.

Steps taken to obtain optimal performance of the Network

The blog lists in short, the steps taken by the researchers to compress the neural network:

A set of pre trained word embeddings called ‘Glove’ was used for this experiment. Glove takes into consideration a words co-occurrence in huge bodies of training data and assesses words’ meanings.
The team started with a model initialized with large embedding space, performed a low rank projection of the embedding layer using Singular Value Decomposition (SVD) and continuing training to regain any lost accuracy.
The aim of the experiment was integrating the embedding table into the neural network to use task-specific training data. This would not only to fine-tune the embeddings but also customize the compression scheme as well.
SVD was used to reduce the embeddings’ dimensionality. This broke down their initial embedding matrix into two smaller embedding matrices with a reduction of parameters to almost 90%.
One of these matrices poses as one layer of a neural network and the second matrix as the layer above it. Between the layers are connections with associated “weights.” which can be readjusted by the training process. These determine how much influence the outputs of the lower layer have on the computations performed by the higher one.
The paper describes a new procedure for selecting the network’s “learning rate”. They vary the ‘cyclical learning rate’ procedure to escape the local minima condition that gets introduced. This technique is called the cyclically annealed learning rate, which gives better performance than either the cyclical learning rate or a fixed learning rate.

Results and conclusion

The system developed by the researchers could shrink a neural network by 90 percent for both LSTM and DAN models, while reducing its accuracy by less than 1%. They compared their model to two alternatives. One in which the embedding table is compressed before network training begins and the other is simple quantization, in which all of the values in the embedding vector are rounded to a limited number of reference values. On testing their approach across a range of compression rates, on different types of neural networks, using different data sets, they found that their system outperformed the other approaches used in the experiment.

You can read the research paper for more details on the experiments and acquired results.