As AI-powered autonomous agents play an increasingly large role in society, we must ensure that their behavior aligns with societal values. To this end, we developed a novel technique for training an AI agent to operate optimally in a given environment while following implicit constraints on its behavior. Our strategy incorporates a bottom-up (or demonstration-based) approach to learning, creating a ‘show and tell’ session for AI. We successfully used the strategy to train an AI system to play a video game similar to the classic Pac-ManTM, scoring as many points as possible without harming any characters.
The approach and results are detailed in our paper, Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration . The project is part of IBM’s Science for Social Good initiative, which focuses science and technology on tackling societal challenges.
Our system combines three main components, as illustrated in Figure 1:
- Inverse reinforcement learning shows the agent how we want it to operate;
- Reinforcement learning allows the agent to learn how to maximize its score; and
- A contextual bandit-based orchestrator enables the agent to combine these two policies in complex ways while indicating which objective is driving behavior at each step.
Showing the agent how to behave
Behavior of autonomous agents is often governed by a complex set of rules. But writing rules to cover every potential situation or circumstance is nearly impossible. In many situations, demonstrating the desired behaviors could be a faster and more effective way of training agents to obey behavioral constraints. These constraints can come from any number of sources, including regulations, business process guidelines, laws, ethical principles, social norms, and moral values. In our video game domain, we wanted to teach an agent to play a game similar to Pac-ManTM without harming other characters, even though that behavior is rewarded with points in unconstrained game play. We used inverse reinforcement learning for this purpose: humans played the game without harming other characters, and the system observed these demonstrations and learned the constraint—to spare the other characters—even though it was never explicitly instructed to do so.
Telling the agent how to win
In addition to obeying behavioral constraints, agents still need to achieve the task at hand. In the video game domain, this means scoring as many points as possible. Although human demonstrations as described above could be used for this as well, we used reinforcement learning for several reasons. First, the human demonstrations were intended to depict constrained behavior and may not have been optimal in terms of maximizing environment rewards. Furthermore, since the demonstrations were given by humans, they are prone to error. In addition, while learning how to play on its own, an agent may try things that we never thought of, and we want to capture that creativity. Finally, reinforcement learning allows an agent to learn how to act in aspects of the environment or task that were not or cannot be demonstrated.
Integrating the two policies in an interpretable manner
Once the agent has learned policies to achieve the two objectives of obeying constraints and maximizing reward, we use a contextual bandit-based orchestrator to choose which policy to follow for each action based on the context of the game. The orchestrator allows the agent to mix policies in novel ways, taking the best actions from either a reward-maximizing or constraint-obeying policy and even learning creative strategies that humans may not think of. The orchestrator also provides a readout of which policy is being followed for each action, making the system transparent and interpretable, important properties for autonomous agents. Knowing how a system arrives at an outcome is a key pillar of trusted AI.
The work builds on prior research to incorporate behavioral constraints into movie recommendation systems, presented in a demo earlier this year at the International Joint Conference on Artificial Intelligence (IJCAI) , by extending it to a more complex domain. In the movie recommendation case, decisions are atomic: actions taken by the agent do not interact with subsequent actions. But a video game is more complicated: actions occur in sequence, and choices have bearing on downstream decisions. Future work could apply this technique to even more complex domains, such as more realistic video games, and eventually to real-world scenarios.
1. Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration. Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita Chandra, Piyush Madan, Kush Varshney, Murray Campbell, Moninder Singh, Francesca Rossi.
2. Using Contextual Bandits with Behavioral Constraints for Constrained Online Movie Recommendation. Avinash Balakrishnan, Djallel Bouneffouf, Nicholas Mattei, Francesca Rossi. IJCAI 2018.