Machine learning has always been dependent on the selection of the right features within a given model; even the selection of the right algorithm. But deep learning changed this. The selection process is now built into the models themselves. Researchers and engineers are now shofting their focus from feature engineering to network engineering. Out of this, AutoML, or meta learning, has become an increasingly important part of deep learning.
AutoML is an emerging research topic which aims at auto-selecting the most efficient neural network for a given learning task. In other words, AutoML represents a set of methodologies for learning how to learn efficiently. Consider for instance the tasks of machine translation, image recognition, or game playing. Typically, the models are manually designed by a team of engineers, data scientist, and domain experts. If you consider that a typical 10-layer network can have ~1010 candidate network, you understand how expensive, error prone, and ultimately sub-optimal the process can be.
This article is an excerpt from a book written by Antonio Gulli and Amita Kapoor titled TensorFlow 1.x Deep Learning Cookbook. This book is an easy-to-follow guide that lets you explore reinforcement learning, GANs, autoencoders, multilayer perceptrons and more.
AutoML with recurrent networks and with reinforcement learning
The key idea to tackle this problem is to have a controller network which proposes a child model architecture with probability p, given a particular network given in input. The child is trained and evaluated for the particular task to be solved (say for instance that the child gets accuracy R). This evaluation R is passed back to the controller which, in turn, uses R to improve the next candidate architecture. Given this framework, it is possible to model the feedback from the candidate child to the controller as the task of computing the gradient of p and then scale this gradient by R. The controller can be implemented as a Recurrent Neural Network (see the following figure). In doing so, the controller will tend to privilege iteration after iterations candidate areas of architecture that achieve better R and will tend to assign a lower probability to candidate areas that do not score so well.
For instance, a controller recurrent neural network can sample a convolutional network. The controller can predict many hyper-parameters such as filter height, filter width, stride height, stride width, and the number of filters for one layer and then can repeat. Every prediction can be carried out by a softmax classifier and then fed into the next RNN time step as input. This is well expressed by the following images taken from Neural Architecture Search with Reinforcement Learning, Barret Zoph, Quoc V. Le:
Predicting hyperparameters is not enough as it would be optimal to define a set of actions to create new layers in the network. This is particularly difficult because the reward function that describes the new layers is most likely not differentiable. This makes it impossible to optimize using standard techniques such as SGD. The solution comes from reinforcement learning. It consists of adopting a policy gradient network.
Besides that, parallelism can be used for optimizing the parameters of the controller RNN. Quoc Le & Barret Zoph proposed to adopt a parameter-server scheme where we have a parameter server of S shards, that store the shared parameters for K controller replicas. Each controller replica samples m different child architectures that are trained in parallel as illustrated in the following images, taken from Neural Architecture Search with Reinforcement Learning, Barret Zoph, Quoc V. Le:
Quoc and Barret applied AutoML techniques for Neural Architecture Search to the Penn Treebank dataset, a well-known benchmark for language modeling. Their results improve the manually designed networks currently considered the state-of-the-art. In particular, they achieve a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art model. Similarly, on the CIFAR-10 dataset, starting from scratch, the method can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. The proposed CIFAR-10 model achieves a test error rate of 3.65, which is 0.09 percent better and 1.05x faster than the previous state-of-the-art model that used a similar architectural scheme.
In Learning Transferable Architectures for Scalable Image Recognition, Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V. Le, 2017. propose to learn an architectural building block on a small dataset that can be transferred to a large dataset. The authors propose to search for the best convolutional layer (or cell) on the CIFAR-10 dataset and then apply this learned cell to the ImageNet dataset by stacking together more copies of this cell, each with their own parameters. Precisely, all convolutional networks are made of convolutional layers (or cells) with identical structures but different weights. Searching for the best convolutional architectures is therefore reduced to searching for the best cell structures, which is faster more likely to generalize to other problems.
Although the cell is not learned directly on ImageNet, an architecture constructed from the best learned cell achieves, among the published work, state-of-the-art accuracy of 82.7 percent top-1 and 96.2 percent top-5 on ImageNet. The model is 1.2 percent better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS—a reduction of 28% from the previous state of the art model. What is also important to notice is that the model learned with RNN+RL (Recurrent Neural Networks + Reinforcement Learning) is beating the baseline represented by Random Search (RS) as shown in the figure taken from the paper. In the mean performance of the top-5 and top-25 models identified in RL versus RS, RL is always winning:
AutoML and learning new tasks
Meta-learning systems can be trained to achieve a large number of tasks and are then tested for their ability to learn new tasks. A famous example of this kind of meta-learning is transfer learning, where networks can successfully learn new image-based tasks from relatively small datasets. However, there is no analogous pre-training scheme for non-vision domains such as speech, language, and text.
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, Chelsea Finn, Pieter Abbeel, Sergey Levine, 2017, proposes a model- agnostic approach names MAML, compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples.
The meta-learner aims at finding an initialization that rapidly adapts to various problems quickly (in a small number of steps) and efficiently (using only a few examples). A model represented by a parametrized function fθ with parameters θ.When adapting to a new task Ti, the model’s parameters θ become θi . In MAML, the updated parameter vector θi is computed using one or more gradient descent updates on task Ti. For example, when using one gradient update, θ ~ = θ − α∇θLTi (fθ) where LTi is the loss function for the task T and α is a meta-learning parameter. The MAML algorithm is reported in this figure:
MAML was able to substantially outperform a number of existing approaches on popular few-shot image classification benchmark. Few shot image is a quite challenging problem aiming at learning new concepts from one or a few instances of that concept. As an example, Human-level concept learning through probabilistic program induction, Brenden M. Lake, Ruslan Salakhutdinov, Joshua B. Tenenbaum, 2015, suggested that humans can learn to identify novel two-wheel vehicles from a single picture such as the one contained in the box as follows: