As AI-powered autonomous agents play an increasingly large role in society, we must ensure that their behavior aligns with societal values. To this end, we developed a novel technique for training an AI agent to operate optimally in a given environment while following implicit constraints on its behavior. Our strategy incorporates a bottom-up (or demonstration-based) approach to learning, creating a ‘show and tell’ session for AI. We successfully used the strategy to train an AI system to play a video game similar to the classic Pac-ManTM, scoring as many points as possible without harming any characters.
Our system combines three main components, as illustrated in Figure 1:
Inverse reinforcement learning shows the agent how we want it to operate;
Reinforcement learning allows the agent to learn how to maximize its score; and
A contextual bandit-based orchestrator enables the agent to combine these two policies in complex ways while indicating which objective is driving behavior at each step.
Figure 1. Overview of our system. We use inverse reinforcement learning (IRL) on demonstrations of desired behavior to learn a reward function that captures the behavioral constraints being demonstrated and obtain a constraint-obeying policy (green box). We apply reinforcement learning (RL) to the original environment rewards to learn a reward-maximizing policy (red box). The two policies are then brought into the orchestrator, which selects between them at each time step (blue box), based on observations from the environment.
Showing the agent how to behave
Behavior of autonomous agents is often governed by a complex set of rules. But writing rules to cover every potential situation or circumstance is nearly impossible. In many situations, demonstrating the desired behaviors could be a faster and more effective way of training agents to obey behavioral constraints. These constraints can come from any number of sources, including regulations, business process guidelines, laws, ethical principles, social norms, and moral values. In our video game domain, we wanted to teach an agent to play a game similar to Pac-ManTM without harming other characters, even though that behavior is rewarded with points in unconstrained game play. We used inverse reinforcement learning for this purpose: humans played the game without harming other characters, and the system observed these demonstrations and learned the constraint—to spare the other characters—even though it was never explicitly instructed to do so.
Telling the agent how to win
In addition to obeying behavioral constraints, agents still need to achieve the task at hand. In the video game domain, this means scoring as many points as possible. Although human demonstrations as described above could be used for this as well, we used reinforcement learning for several reasons. First, the human demonstrations were intended to depict constrained behavior and may not have been optimal in terms of maximizing environment rewards. Furthermore, since the demonstrations were given by humans, they are prone to error. In addition, while learning how to play on its own, an agent may try things that we never thought of, and we want to capture that creativity. Finally, reinforcement learning allows an agent to learn how to act in aspects of the environment or task that were not or cannot be demonstrated.
Integrating the two policies in an interpretable manner
Once the agent has learned policies to achieve the two objectives of obeying constraints and maximizing reward, we use a contextual bandit-based orchestrator to choose which policy to follow for each action based on the context of the game. The orchestrator allows the agent to mix policies in novel ways, taking the best actions from either a reward-maximizing or constraint-obeying policy and even learning creative strategies that humans may not think of. The orchestrator also provides a readout of which policy is being followed for each action, making the system transparent and interpretable, important properties for autonomous agents. Knowing how a system arrives at an outcome is a key pillar of trusted AI.
The work builds on prior research to incorporate behavioral constraints into movie recommendation systems, presented in a demo earlier this year at the International Joint Conference on Artificial Intelligence (IJCAI) , by extending it to a more complex domain. In the movie recommendation case, decisions are atomic: actions taken by the agent do not interact with subsequent actions. But a video game is more complicated: actions occur in sequence, and choices have bearing on downstream decisions. Future work could apply this technique to even more complex domains, such as more realistic video games, and eventually to real-world scenarios.
Members of the research team. Top row, from left: Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita Chandra, Piyush Madan. Second row, from left: Kush Varshney, Murray Campbell, Moninder Singh, Francesca Rossi.
Founded in March 2020 just as the pandemic’s wave was starting to wash over the world, the Consortium has brought together 43 members with supercomputing resources. Private and public enterprises, academia, government and technology companies, many of whom are typically rivals. “It is simply unprecedented,” said Dario Gil, Senior Vice President and Director of IBM Research, one of the founding organizations. “The outcomes we’ve achieved, the lessons we’ve learned, and the next steps we have to pursue are all the result of the collective efforts of these Consortium’s community.”
The next step? Creating the National Strategic Computing Reserve to help the world be better prepared for future global emergencies.
IBM is supporting marine research organization ProMare to provide the technologies for the Mayflower Autonomous Ship (MAS). Named after another famous ship from history but very much future focussed, the new Mayflower uses AI and energy from the sun to independently traverse the ocean, gathering vital data to expand our understanding of the factors influencing its health.