Open AI researchers have built a simple hide and seek game environment for multi-agent competition where they observed that AI agents can learn complex strategies and skills on their own as the game progresses. In fact, these AI agents built six distinct strategies and counterstrategies, some of which were not even supported by the training environment. The researchers concluded that such multi-agent co-adaptation may one day produce extremely complex and intelligent behavior.
The Hide and Seek training environment
AI agents play a team-based hide-and-seek game in a physics-based environment. Hiders (blue) avoid the line of sight from the seekers, and the seekers keep the vision of the hiders. The environment has various objects (walls, ramps, blocks) that agents can grab and also lock in place. There are also randomly generated immovable rooms and walls that the agents must learn to navigate. Before the game, hiders are given a preparation time to run away or change their environment and the seekers are immobilized.
Agents are given a team-based reward; hiders are given a reward of +1 if all hiders are hidden and -1 if any hider is seen by a seeker. Seekers are given the opposite reward, -1 if all hiders are hidden and +1 otherwise. There are no explicit incentives for agents to interact with objects in the environment; they are penalized if they go too far outside the play area.
- The agents can move by setting a force on themselves in the x and y directions as well as rotate along the z-axis.
- The agents can see objects in their line of sight and within a frontal cone.
- The agents can sense the distance to objects, walls, and other agents around them using a lidar-like sensor.
- The agents can grab and move objects in front of them.
- The agents can lock objects in place. Only the team that locked an object can unlock it.
AI agents develop six distinct strategies
Agents are trained using self-play and agent policies are composed of two separate networks with different parameters. This includes a policy network that produces an action distribution and a critic network that predicts the discounted future returns. Policies are optimized using Proximal Policy Optimization (PPO) and Generalized Advantage Estimation (GAE); training is performed using OpenAI’s rapid, it’s general-purpose RL training system.
The researchers noticed that as agents train against each other in hide-and-seek, six distinct strategies emerge.
- Initially, hiders and seekers learn to crudely run away and chase.
- After approximately 25 million episodes of hide-and-seek, the hiders learn to use the tools at their disposal and intentionally modify their environment.
- After another 75 million episodes, the seekers also learn rudimentary tool use; they learn to move and use ramps to jump over obstacles, etc.
- 10 million episodes later, the hiders learn to defend against this strategy; the hiders learn to bring the ramps to the edge of the play area and lock them in place, seemingly removing the only tool the seekers have at their disposal.
- After 380 million total episodes of training, the seekers learn to bring a box to the edge of the play area where the hiders have locked the ramps. The seekers then jump on top of the box and surf it to the hiders’ shelter
- In response, the hiders learn to lock all of the boxes in place before building their shelter.
They also found some surprising behavior by these AI agents.
- Box surfing: Since agents move by applying forces to themselves, they can grab a box while on top of it and “surf” it to the hider’s location.
- Endless running: Without adding explicit negative rewards for agents leaving the play area, in rare cases, hiders will learn to take a box and endlessly run with it.
- Ramp exploitation (hiders): Hiders abuse contact physics and remove ramps from the play area.
- Ramp exploitation (seekers): Seekers learn that if they run at a wall with a ramp at the right angle, they can launch themselves upward.
The researchers concluded that complex human-relevant strategies and skills can emerge from multi-agent competition and standard reinforcement learning algorithms at scale. They state, “our results with hide-and-seek should be viewed as a proof of concept showing that multi-agent auto-curricula can lead to physically grounded and human-relevant behavior.”
This research was well appreciated by readers. Many people took to Hacker News to congratulate the researchers. Here are a few comments.
“Amazing. Very cool to see this sort of multi-agent emergent behavior. Along with the videos, I can’t help but get a very ‘Portal’ vibe from it all. “Thank you for helping us help you help us all.”
“This is incredible. The various emergent behaviors are fascinating. It seems that OpenAI has a great little game simulated for their agents to play in. The next step to make this even cooler would be to use physical, robotic agents learning to overcome challenges in real meatspace!”
“I’m completely amazed by that. The hint of a simulated world seems so matrix-like as well, imagine some intelligent thing evolving out of that. Wow.”