#### AI

# Addressing open fundamental questions in reinforcement learning

December 11, 2019 | Written by: Mark Squillante, Yingdong Lu, and Chai Wah Wu

Categorized: AI

Share this post:

Despite its rich history addressing a wide variety of decision-making problems, reinforcement learning can suffer from errors in approximation and estimation that cause the choice of suboptimal actions. In seeking to address these problems, recent work [2] has raised three open fundamental questions at the heart of reinforcement learning.

New research by our team at IBM Research [3], which will be presented at NeurIPS 2019, introduces an innovative probabilistic framework for reinforcement learning that sheds new light on and helps to resolve these open fundamental questions. We present new theoretical results that address these open fundamental questions. We also present new empirical results that demonstrate improved performance of using a new stochastic approach.

**Classical Reinforcement Learning**

A standard reinforcement learning (RL) framework consists of a learning agent interacting with a stochastic environment, which is modeled as a discrete-space, discrete-time discounted Markov decision process (MDP). Through iterative application of the classical Bellman operator, value iteration is used to determine the optimal value function and the optimal action for each state of the MDP that maximizes the expected discounted cumulative reward.

More formally, a stationary policy \( \pi\) of the MDP induces a value function \(V\) and an action-value function \( Q\) where \( V(x):=Q(x,\pi(x))\) defines the expected discounted cumulative reward under policy \( \pi\) starting in state \(x\) of the MDP and where \(Q\) satisfies the so-called Bellman equation

\[ Q(x,a)=R(x,a)+\gamma \mathbb{E}_P Q(x’,\pi(x’)).\]

Here, \(R\) denotes the reward function over state-action pairs, \( \gamma\) denotes the discount factor, \( \mathbb{E}_P\) denotes expectation with respect to the transition probability kernel \( P\) of the MDP, and \( x’\) denotes the next state random variable. The goal is to determine a policy \(\pi^*\) that achieves the optimal value function \( V^*(x)\), which also produces the optimal action-value function \(Q^*(x,a)\).

The classical Bellman operator \( \mathcal{T}_B \) is defined pointwise as

\[ \mathcal{T}_B Q(x,a) := R(x,a) + \gamma \mathbb{E}_P \max_b Q(x’,b).\]

It is well known that, under the Bellman operator with \( Q_{k+1}:= \mathcal{T}_B Q_k\), value iteration is guaranteed to yield \( Q^*(x,a)\) in the limit as \( k \rightarrow \infty\) for all states \( x\) and actions \(a\) of the MDP. This, in turn, can then be used to obtain the optimal policy \( \pi^*\).

**Alternative Operators and Open Fundamental Questions**

When there are no approximation or estimation errors, then value iteration with the Bellman operator can be used to solve the corresponding MDP exactly and yield the optimal policy. On the other hand, the presence of approximation or estimation errors can cause suboptimal actions to be chosen instead of an optimal action under the Bellman operator, and thus in turn potentially causing errors in identifying truly optimal actions. One typically encountered example environment of such approximation/estimation errors can occur when a discrete-state, discrete-time MDP is used to approximate a continuous-state, continuous-time system.

Building on the work in [1] with respect to action-gap regularity and its performance loss benefits, researchers have recently sought to address this problem by considering (purely deterministic) value-iteration operators as alternatives to the classical Bellman operator [2]. In particular, they proposed the so-called consistent Bellman operator \( \mathcal{T}_C \). They then proceeded to show that the consistent Bellman operator yields the optimal policy \( \pi^*\), and specifically that \( \mathcal{T}_C \) is both optimality-preserving and gap-increasing according to provided (deterministic) definitions compatible with those in [1]. The former property ensures optimality whereas the latter property may assist the value-iteration algorithm to determine the optimal actions of an agent faster, more easily, and with less errors of mislabeling suboptimal actions.

Lastly, the work in [2] raised three open fundamental questions at the heart of RL in the presence of approximation and estimation errors. These open questions concern the possibility of weaker conditions for optimality, the statistical efficiency of their proposed deterministic operators, and the possibility of finding a maximally efficient operator.

**Robust Stochastic Operators and Open Fundamental Questions**

In our new paper [3], we introduce a new approach based on a novel stochastic framework that can increase the action gap well beyond the limited region for individual value iterations under any purely deterministic operator – via a random variable – while controlling, in a probabilistic manner, the overall value iterations to preserve optimality – via a sequence of random variables. To the best of our knowledge, this is the first proposal and theoretical analysis of such types of robust stochastic operators, which is an approach not often seen in the RL literature and should be exploited to a much greater extent.

More precisely, for all sequences of independent nonnegative random variables \( \{ \beta_k : k \in \mathbb{Z}_+ \}\) having expectation \( \overline{\beta}_k := \mathbb{E}[\beta_k]\) between \( 0\) and \( 1\) inclusively, we define

\[ \mathcal{T}_{\beta_k} Q(x,a) := R(x,a) + \gamma \mathbb{E}_P \max_b Q(x’,b) – \beta_k (V_k(x) – Q_k(x,a)).\]

Then the general family of robust stochastic operators comprise all sequences of operators \( {\mathcal{T}_{\beta_k}}\) with each defined as above over all probability distributions for each random variable in the sequences \( {\beta_k}\) with \(\overline{\beta}_k \in [0,1]\). This general family of stochastic operators supports having the random variables \( \beta_k\) follow a different probability distribution for each \( k\).

Upon defining optimality-preserving and gap-increasing in a stochastic sense compatible with the deterministic definitions in [2], we establish that our general family of robust stochastic operators is both optimality-preserving and gap-increasing. Since the value-iteration sequence generated under our stochastic operators is based on realizations of independent nonnegative random variables, our family of stochastic operators subsumes the family of purely deterministic operators in [2] as a strict subset (because the realizations of random variables can be fixed to match that of any deterministic operators as a special case). We further establish that stochastic and variability ordering properties among the sequence of random operators lead to corresponding ordering properties among the action gaps of the random operators. These theoretical results offer key relational insights into important orderings of different operators in our family of stochastic operators.

Our introduction of a novel probabilistic framework based on a new family of stochastic operators and our shifting the focus from one deterministic operator to a sequence of stochastic operators sheds new light on and helps to resolve the open fundamental questions raised in [2]. Specifically, our theoretical results show that the conditions for optimality are much weaker and the statistical efficiency of our operators can be made much stronger, both allowing significant degrees of freedom in finding alternatives to the Bellman operator for different purposes and applications. Meanwhile, these important improvements completely alter and clarify the question of finding the maximally efficient operators from a finite dimensional parameter optimization problem suggested in [2] to an optimization problem in an infinite dimensional space (of the infinite sequences of random variables). Yet another important implication of our theoretical results is that the order relationships among the sequences of random operators with respect to action gaps, corresponding to our stochastic and variability ordering results, may potentially lead to determining the best sequence of random variables and possibly even lead to maximally efficient operators.

Empirical results further demonstrate the benefits of the new family of stochastic operators, significantly outperforming the classical Bellman operator and the recently proposed consistent Bellman operator. Experiments were conducted with several existing OpenAI Gym codes, namely Acrobot, Mountain Car, Cart Pole and Lunar Lander, each modified solely to replace the Bellman operator with the consistent Bellman operator and with various instances of our family of robust stochastic operators. As a representative example, the following figure illustrates our empirical results for Acrobot with a goal of minimizing the score. This figure plots the Acrobot problem score, averaged over moving windows of 1000 episodes across 50 experimental trials, as a function of the number of episodes for the set of operators during the training phase. We observe that the average scores under our stochastic operators generally exhibit much better performance than under the Bellman operator or the consistent Bellman operator. The corresponding results for the testing phase exhibited similar performance trends with the best average scores obtained under our stochastic operators.

**Discussion**

We introduce an innovative probabilistic framework for RL which includes a new family of stochastic operators that subsumes as a strict subset the classical Bellman operator and other purely deterministic operators proposed in the literature and that is proven to be optimality-preserving and gap-increasing. Our stochastic framework and theoretical results shed new light on and help to resolve recently raised open fundamental questions in RL, specifically showing that the conditions for optimality can be made much weaker, that the statistical efficiency of operators can be made much stronger, and that searching for maximally efficient operators is an optimization problem in an infinite dimensional space for which we establish stochastic ordering properties. We view the problem of finding maximally efficient operators for RL problems as identifying sequences of random operators that address the fundamental tradeoffs identified in our paper in order to maximize action-gap regularity for the suboptimal actions of each state. Our theoretical and empirical results further raise a related fundamental issue that concerns whether maximizing the action gap is sufficient to improve the performance of value-iteration algorithms in environments with approximation or estimation errors.** **

**References**

[1] A. Farahmand. Action-gap phenomenon in reinforcement learning. *Advances in Neural Information Processing Systems*, 24, 2011.

[2] M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos. Increasing the action gap: New operators for reinforcement learning. In *Proc. Thirtieth AAAI Conference on Artificial* *Intelligence*, AAAI’16, pages 1476–1483. AAAI Press, 2016.

[3] Y. Lu, M. S. Squillante, C. W. Wu. A family of robust stochastic operators for reinforcement learning. *Proc. NeurIPS*, 2019.

*Special thanks to Thomas Jackman for his edits and assistance with this post.*

**Mark Squillante**

Distinguished Research Staff Member and Manager, Foundations of Probability and Dynamics, Math Sciences, IBM Research

**Yingdong Lu**

Research Staff Member

### We’ve moved! The IBM Research blog has a new home

In an effort better integrate the IBM Research blog with the IBM Research web experience, we have migrated to a new landing page: https://research.ibm.com/blog

### Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

### From HPC Consortium’s success to National Strategic Computing Reserve

Founded in March 2020 just as the pandemic’s wave was starting to wash over the world, the Consortium has brought together 43 members with supercomputing resources. Private and public enterprises, academia, government and technology companies, many of whom are typically rivals. “It is simply unprecedented,” said Dario Gil, Senior Vice President and Director of IBM Research, one of the founding organizations. “The outcomes we’ve achieved, the lessons we’ve learned, and the next steps we have to pursue are all the result of the collective efforts of these Consortium’s community.” The next step? Creating the National Strategic Computing Reserve to help the world be better prepared for future global emergencies.