Science for Social Good

‘Show and Tell’ Helps AI Agent Align with Societal Values

Share this post:

As AI-powered autonomous agents play an increasingly large role in society, we must ensure that their behavior aligns with societal values. To this end, we developed a novel technique for training an AI agent to operate optimally in a given environment while following implicit constraints on its behavior. Our strategy incorporates a bottom-up (or demonstration-based) approach to learning, creating a ‘show and tell’ session for AI. We successfully used the strategy to train an AI system to play a video game similar to the classic Pac-ManTM, scoring as many points as possible without harming any characters.

The approach and results are detailed in our paper, Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration [1]. The project is part of IBM’s Science for Social Good initiative, which focuses science and technology on tackling societal challenges.

Our system combines three main components, as illustrated in Figure 1:

  1. Inverse reinforcement learning shows the agent how we want it to operate;
  2. Reinforcement learning allows the agent to learn how to maximize its score; and
  3. A contextual bandit-based orchestrator enables the agent to combine these two policies in complex ways while indicating which objective is driving behavior at each step.
Ovewview of system for training an AI agent

Figure 1. Overview of our system. We use inverse reinforcement learning (IRL) on demonstrations of desired behavior to learn a reward function that captures the behavioral constraints being demonstrated and obtain a constraint-obeying policy (green box). We apply reinforcement learning (RL) to the original environment rewards to learn a reward-maximizing policy (red box). The two policies are then brought into the orchestrator, which selects between them at each time step (blue box), based on observations from the environment.

Showing the agent how to behave

Behavior of autonomous agents is often governed by a complex set of rules. But writing rules to cover every potential situation or circumstance is nearly impossible. In many situations, demonstrating the desired behaviors could be a faster and more effective way of training agents to obey behavioral constraints. These constraints can come from any number of sources, including regulations, business process guidelines, laws, ethical principles, social norms, and moral values. In our video game domain, we wanted to teach an agent to play a game similar to Pac-ManTM without harming other characters, even though that behavior is rewarded with points in unconstrained game play. We used inverse reinforcement learning for this purpose: humans played the game without harming other characters, and the system observed these demonstrations and learned the constraint—to spare the other characters—even though it was never explicitly instructed to do so.

Telling the agent how to win

In addition to obeying behavioral constraints, agents still need to achieve the task at hand. In the video game domain, this means scoring as many points as possible. Although human demonstrations as described above could be used for this as well, we used reinforcement learning for several reasons. First, the human demonstrations were intended to depict constrained behavior and may not have been optimal in terms of maximizing environment rewards. Furthermore, since the demonstrations were given by humans, they are prone to error. In addition, while learning how to play on its own, an agent may try things that we never thought of, and we want to capture that creativity. Finally, reinforcement learning allows an agent to learn how to act in aspects of the environment or task that were not or cannot be demonstrated.

Integrating the two policies in an interpretable manner

Once the agent has learned policies to achieve the two objectives of obeying constraints and maximizing reward, we use a contextual bandit-based orchestrator to choose which policy to follow for each action based on the context of the game. The orchestrator allows the agent to mix policies in novel ways, taking the best actions from either a reward-maximizing or constraint-obeying policy and even learning creative strategies that humans may not think of. The orchestrator also provides a readout of which policy is being followed for each action, making the system transparent and interpretable, important properties for autonomous agents. Knowing how a system arrives at an outcome is a key pillar of trusted AI.

Increasing complexity

The work builds on prior research to incorporate behavioral constraints into movie recommendation systems, presented in a demo earlier this year at the International Joint Conference on Artificial Intelligence (IJCAI) [2], by extending it to a more complex domain. In the movie recommendation case, decisions are atomic: actions taken by the agent do not interact with subsequent actions. But a video game is more complicated: actions occur in sequence, and choices have bearing on downstream decisions. Future work could apply this technique to even more complex domains, such as more realistic video games, and eventually to real-world scenarios.


1. Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration. Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita Chandra, Piyush Madan, Kush Varshney, Murray Campbell, Moninder Singh, Francesca Rossi.

2. Using Contextual Bandits with Behavioral Constraints for Constrained Online Movie Recommendation. Avinash Balakrishnan, Djallel Bouneffouf, Nicholas Mattei, Francesca Rossi. IJCAI 2018.

Members of the research team

Members of the research team. Top row, from left: Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita Chandra, Piyush Madan. Second row, from left: Kush Varshney, Murray Campbell, Moninder Singh, Francesca Rossi.

More Science for Social Good stories

Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

Continue reading

From HPC Consortium’s success to National Strategic Computing Reserve

Founded in March 2020 just as the pandemic’s wave was starting to wash over the world, the Consortium has brought together 43 members with supercomputing resources. Private and public enterprises, academia, government and technology companies, many of whom are typically rivals. “It is simply unprecedented,” said Dario Gil, Senior Vice President and Director of IBM Research, one of the founding organizations. “The outcomes we’ve achieved, the lessons we’ve learned, and the next steps we have to pursue are all the result of the collective efforts of these Consortium’s community.” The next step? Creating the National Strategic Computing Reserve to help the world be better prepared for future global emergencies.

Continue reading

This ship has no crew and it will transform our understanding of the ocean. Here’s how

IBM is supporting marine research organization ProMare to provide the technologies for the Mayflower Autonomous Ship (MAS). Named after another famous ship from history but very much future focussed, the new Mayflower uses AI and energy from the sun to independently traverse the ocean, gathering vital data to expand our understanding of the factors influencing its health.

Continue reading