April 27, 2018 | Written by: IBM Research Editorial Staff
Share this post:
Sequential decision-making usually involves planning, acting, and learning about temporally extended courses of actions over different time scales. In the reinforcement learning (RL) framework, “options” (which represent pre-planned sequences of primitive actions) can accelerate learning and planning. But autonomously identifying good options, or “option discovery,” is still an open problem. “Eigenoptions” (which build upon representations of diffusive information flow in the environment) can be useful in facilitating exploration in RL problem domains. Our ICLR 2018 paper, Eigenoption Discovery through the Deep Successor Representation, by Marlos C. Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro and Murray Campbell, extends the use of eigenoptions to settings with stochastic transitions and in which states are represented from raw sensory data. We demonstrate the promise of our approach using Atari games.
Eigenoptions for exploration
The notion of “eigenoptions” was introduced last year at ICML (Machado et al., 2017). It combines the options framework with spectral analysis of the “graph Laplacian” (matrix of states and state-transitions), which was first proposed for RL at ICML in 2005 (Mahadevan, 2005). Machado et al. showed that in the “tabular” case (where each state is a distinct node), the eigenvector analysis yielded a diverse set of options that are useful for exploration in RL. They could also train effective linear function approximation of the eigenfunctions using features instead of a tabular representation.
Exploiting the successor representation
Our ICLR paper extends that work with the following new contributions. First, our approach extends to environments with stochastic state transitions, whereas the previous computation of eigenoptions required deterministic (and symmetric) state transitions. Second, we go beyond linear function approximation over hand-crafted features to general nonlinear function approximation over raw sensory data. Both of these are achieved by making use of the “successor representation” (Dayan, 1993), which estimates for a given starting state the cumulative discounted probability density of eventually visiting other states. Specifically, we learn the successor representation (SR) by using a novel algorithm to train a deep neural net on a large number of sample state transitions. Our approach is related to recent work by Kulkarni et al. (2016) on estimating SRs based on training auto-encoder representations of raw input pixels. Instead of auto-encoders, we employ a predictive architecture similar to that of Oh et al. (2015), where the next frame is predicted from the current frame and current action. This enables learning the effects of actions and what parts of the environment are under the agent’s control. Note that neither of these previous works adopt the options framework.
Validating via Atari
We obtained empirical validation of our methodology by performing extensive experiments in two different domains. The first domain, “Four Rooms,” is a grid-world problem with tabular representation of the states and deterministic transitions from one state to an adjacent state. The second domain comprises the Atari games Bank Heist, Montezuma’s Revenge, and Ms. Pac-Man. These games were chosen because they each require disparate types of skills. Instead of tabular state representation, we use the raw pixels of the Atari game display to represent a current game state. In both domains, we collect random samples of state transitions in order to estimate SRs, and then use the SRs to compute eigenoptions. In the Four Rooms domain, we obtained significant performance improvements over primitive actions, as shown in Figure 1.
Figure 1: (Left) Illustration of the “Four Rooms” tabular domain. (Center) We performed one set of experiments where training episodes started at state S1 and ended at state G1; learning curves for varying numbers of eigenoptions are plotted. (Right) A second set of experiments had episodes starting at state S2 and ending at G2; resulting learning curves are plotted. In both experiments, eigenoptions outperform primitive-action RL, and smaller numbers of eigenoptions (4, 8, 32) tend to be more effective than larger numbers (64, 128).
In the Atari games, depicted in Figure 2, we give qualitative evaluation of the discovered eigenoptions. The figures show that each eigenoption spends almost all of its time at its terminal location, showing that eigenoptions are indeed purposeful. We can also see that our algorithm pushes agents into corners and other relevant parts of the state space (particularly in Montezuma’s Revenge), corroborating the intuition that eigenoptions improve exploration.
Figure 2: Plots of the state-visitation density of several different learned eigenoptions in three Atari games: (Left) Bank Heist, (Center) Montezuma’s Revenge, and (Right) Ms. Pac-Man. In each game, each option is depicted by an avatar with a unique color, and the darkness of the color indicates the visitation density of the avatar location. Note that an eigenoption’s overwhelming mass of visitations corresponds to its terminal state.
While there is a lot of current work on options, it is nonetheless quite difficult to automatically discover options from raw sensor data. Our work provides a principled combination of novel conceptual approaches to tackle the challenge of option discovery in RL.
For further reading, an independent and more detailed blog post about our work may be found here: https://manantomar.github.io/2018/04/10/blogpost.html
Peter Dayan. Improving Generalization for Temporal Difference Learning: The Successor Representation. Neural Computation, 5(4):613–624, 1993.
Tejas D. Kulkarni, Ardavan Saeedi, Simanta Gautam, and Samuel J. Gershman. Deep Successor Reinforcement Learning. CoRR, abs/1606.02396, 2016b.
Marlos C. Machado, Marc G. Bellemare, and Michael Bowling. A Laplacian Framework for Option Discovery in Reinforcement Learning. In Proc. of the International Conference on Machine Learning (ICML), pp. 2295–2304, 2017.
Sridhar Mahadevan. Proto-value Functions: Developmental Reinforcement Learning. In Proc. of the International Conference on Machine Learning (ICML), pp. 553–560, 2005.
Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, and Satinder P. Singh. Action- Conditional Video Prediction using Deep Networks in Atari Games. In Advances in Neural Information Processing Systems (NIPS), pp. 2863–2871, 2015.