On the second day of NeurIPS conference held in Montreal, Canada last year, Dr. Joelle Pineau presented a talk on reproducibility in reinforcement learning. She is an Associate Professor at McGill University and Research Scientist for Facebook, Montreal, and the talk is ‘Reproducible, Reusable, and Robust Reinforcement Learning’.
Reproducibility and crisis
Dr. Pineau starts by stating a quote from Bollen et. al in National Science Foundation: “Reproducibility refers to the ability of a researcher to duplicate the results of a prior study, using the same materials as were used by the original investigator. Reproducibility is a minimum necessary condition for a finding to be believable and informative.”
Reproducibility is not a new concept and has appeared across various fields. In a 2016 The Nature journal survey of 1576 scientists, 52% said that there is a significant reproducibility crisis, 38% agreed to a slight crisis.
Reinforcement learning is a very general framework for decision making. About 20,000 papers are published in this area alone in 2018 and the year is not even over yet, compared to just about 2,000 papers in the year 2000. The focus of the talk is a class of reinforcement learning that has gotten the most attention and has shown a lot of promise for practical applications—policy gradients. In this method, the idea is that the policy/strategy is learned as a function and this function can be represented by a neural network.
Pineau picks four research papers in the class of policy gradients that come across literature most often. They use the Mujocu simulator to compare the four algorithms. It is not important to know which algorithm is which but the approach to empirically compare these algorithms is the intention. The results were different in different environments (Hopper, Swimmer) but the variance was also drastically different for an algorithm. Even on using different code and policies the results were very different for a given algorithm in different environments.
It was observed that people writing papers may not be always motivated to find the best possible hyperparameters and very often use the default hyperparameters. On using the best hyperparameters possible for two algorithms compared fairly, the results were pretty clean, distinguishable. Where n=5, five different random seeds. Picking n influences the size of the confidence interval (CI). n=5 here as most papers used 5 trials at the most.
Some people were also run “n” runs where n was not specified and would report the top 5 results. It is a good way to show good results but there’s a strong positive bias, the variance appears to be small.
Source: NeurIPS website
Some people argue that the field of reinforcement learning is broken. Pineau stresses that this is not her message and notes that sometimes fair comparisons don’t have to give the cleanest results. Different methods may have a very distinct set of hyperparameters in number, value, and variable sensitivity. Most importantly the best method to choose heavily depends on the data and computation budget you can spare. An important point to get the said reproducibility when using algorithms to your problem.
Pineau and her team surveyed 50 RL papers from 2018 and found that significance testing was applied only on 5% of the papers. Graphs and shading is seen in many papers but without information on what the shading area is, confidence interval or standard deviation cannot be known. Pineau says: “Shading is good but shading is not knowledge unless you define it properly.”
A reproducibility checklist
For people publishing papers Pineau presents a checklist created in consultation with her colleagues. It says for algorithms the things included should be a clear description, an analysis of complexity, and a link to source code and dependencies.
For theoretical claims, a statement of the result, a clear explanation of any assumptions, and a complete proof of the claim should be included. There are also other items presented in the checklist for figures and tables. Here is the complete checklist:
Source: NeurIPS website
Role of infrastructure on reproducibility
People can think that since the experiments are run on computers results will be more predictable than those of other sciences. But even in hardware, there is room for variability. Hence, specifying it can be useful. For example the properties of CUDA operations.
On some myths
“Reinforcement Learning is the only case of ML where it is acceptable to test on your training set.”
Do you have to train and test on the same task? Pineau says that you really don’t have to after presenting three examples.
- The first one is where the agent moves around in four directions on an image then identifies what the image is, on higher n, the variance is greatly reduced.
- The second one is of an Atari game where the black background is replaced with videos which are a source of noise, a better representation of the real world as compared to a simulated limited environment where external real-world factors are not present.
- She then talks about multi-task RL in photorealistic simulators to incorporate noise. The simulator is an emulator built from images videos taken from real homes. Environments created are completely photorealistic but have properties of the real world, for example, mirror reflection.
Working in the real world is very different than a limited simulation. For one, a lot more data is required to represent the real world as compared to a simulation.
The talk ends with a message that science is not a competitive sport but is a collective institution that aims to understand and explain. There is an ICLR reproducibility challenge where you can join. The goal is to get community members to try and reproduce the empirical results presented in a paper, it is on an open review basis. Last year, 80% changed their paper with the feedback given by contributors who tested a given paper.
Head over to NeurIPS facebook page for the entire lecture and other sessions from the conference.