Proximal policy optimization (PPO) is a deep reinforcement learning algorithm for improving the performance of models by using reinforcement learning. The policy in PPO indicates how an agent—such as a robot or program—has learned to act in the world. This approach to revolutionized learning and is a key part of how agents and systems can learn as they interact with users and with the world. Modern large language models (like derivatives of OpenAIs ChatGPT) use PPO for reinforcement learning from human feedback (RLHF). PPO is also one of the most common algorithms for training agents in video games, robotic process automation and in self-driving cars.
PPO was first introduced by John Schulman, Filip Wolski, et al1 in a paper titled Proximal Policy Optimization Algorithms. To understand how PPO works and why it is important, we should begin with reinforcement learning (RL). The RL process uses machine learning to help a system pick an action based on the environment and on a goal. Humans make these types of decisions all the time, for instance, in a game of blackjack we decide whether to “hit,” (get a new card) or “stay,” (keep the cards that we already have). Reinforcement learning has four fundamental elements: agent, environment, actions and rewards. In the example of playing blackjack we have:
Agent: the human playing the game.
Environment: what cards the player has and what cards that they can see that the dealer has
Actions: whether to “hit” (get a new card) or “stay” (keep the current cards)
Rewards: whether the user wins or loses the hand
A policy is a strategy or set of rules that an agent, like a robot or a software program, follows to select a discrete action based on its environment. The policy gives a mapping from environments to possible actions that the agent should take. Policies can be simple with a fixed action for each state, or they can be complex where the policy incorporates estimations or calculations to pick and learning mechanisms to determine optimal actions. The goal is to find a policy that maximizes the cumulative reward the agent receives over time. You can imagine a simple policy for blackjack being “hit if your cards are under 17, stay if your cards are over 18.” The agent would compare the environment to the policy and select an action based on the policy.
RL is a way to help an agent learn a policy that will help it decide which actions to take to maximize the rewards. PPO is a powerful technique to help agents learn more effectively. It has been used in applications from building robots to improving how large language models (LLMs) reason and respond to prompts. PPO is one of the fundamental parts of reinforcement learning using human feedback (RLHF) used to train LLMs. It’s the latter of these methods that has driven much of the recent research into RL.
There are two broad families of RL methods. The first are value-based methods that evaluate the environment and determine the best action for that environment based on the expected reward. All possible actions are evaluated and the agent selects the action with the highest expected reward. To determine the best, the agent focuses on learning a value function that estimates the expected cumulative reward (return) from each environment or environment-action pair. The policy is derived indirectly by selecting actions that maximize the estimated value by using a value. The idea is to discover an optimal value function for each possible action that will lead to an optimal policy. When there are limited pairs of actions and environment to be explored, this approach can be highly effective. The agent will calculate the expected reward for each action and choose based on that estimate. With large action spaces, when the numbers of possible actions are high, it becomes difficult to estimate a return for each possible action. A robot deciding how to rotate an arm with multiple joints in 3D space is faced with thousands of possible decisions, far too many to calculate a function for each possible combination of joint positions. It also restricts the agent to predefined actions, meaning that the agent is unable to come up with new or novel strategies.
This leads to a second approach: policy-based methods. A policy is a strategy or set of rules that an agent, like a robot or a software program, follows to make decisions based on its environment. The policy gives a mapping from environments to possible actions that the agent should take. Policies can be simple, with a fixed action for each state, or they can be complex where the policy incorporates advantage estimates and calculations to pick and learning mechanisms to determine optimal actions. The goal is to find a policy that maximizes the cumulative reward the agent receives over time. Imagine a model that is trying to decide when to buy, sell or hold specific stocks. A policy would define when the agent should think about buying or selling a stock depending on the trend of the stock or the behavior of the overall stock market.
A policy-based learning method learns which actions to prefer for a given state without estimating their expected outcomes directly. This approach means that the agent can learn many more possible actions for each state and doesn’t need to estimate a value function for each of them. The agent optimizes the policy by adjusting the likelihood of choosing an action to maximize the expected return without estimating an advantage function for each action. This process requires a great deal more data and complex learning architecture because the number of possible actions for a given state is theoretically infinite. Policy gradients are a part of the second category.
With policy learning, there are two fundamental approaches to learning a policy for an agent. On-policy methods use only the current actions to drive learning—learning from what you are doing. Imagine a self-driving car that is trying to find the optimal route to a destination over the course of several trips. In on-policy learning that car would learn only from the routes that it takes.
The policy directs the agents’ actions in every environment including the decision-making process while learning. The agent evaluates the outcomes of its present actions, refining its strategy incrementally. This method allows the agent to adapt and improve its decision-making by directly engaging with the environment and learning from its own real-time interactions.
Off-policy methods would have the car observing the routes taken by other self-driving cars to learn from their actions. The car doesn’t have to follow the same policy as the cars it’s observing but it can observe what rewards they receive from their actions and update its own policy accordingly. It involves learning the value of the optimal policy independently of the agent’s actions. These methods enable the agent to learn from observations about the optimal policy, even when it’s not following it. This method is useful for learning from a fixed dataset or a teaching policy.
The policy gradient is an on-policy approach to creating a policy that directly optimizes the policy itself by following the gradient of expected return with respect to the policy’s parameters. Conceptually it is similar to stochastic gradient descent (SGD), which tries to minimize the error of a prediction with a loss function. There is a key difference though: SGD typically estimates parameters to minimize a loss. A policy gradient on the other hand attempts to estimates the parameters of the policy distribution to prioritize actions that maximize the reward. The goal is to create a probability distribution for all possible actions that makes actions that will give greater rewards more likely and actions that will give lesser rewards less likely.
Policy gradients are what are called actor–critic methods. The actor is a policy network that selects actions and the critic that estimates how well that policy matches actions for states to a specific reward.
If your policy is (a probability distribution over actions), and the agents’ goal is to maximize an expected return denoted by hen it would take gradient ascent steps defined as:
In this is the vector of policy parameters that determines the probability distribution of actions. The old policy is updated, , by the previous policy plus the learning rate times the policy gradient. The policy gradient is the gradient with respect to times an objective function that represents the expected return from the policy.
A policy gradient learns by applying the current policy, selecting an action, evaluating the reward and then computing the gradient of the objective function. The algorithm then applies a policy update for the next iteration when it can take an action and observe the advantage gained by the update.
At the core of the policy gradient approach is the policy gradient theorem, which shows how a policy can be optimized. The gradient is simply how much the probability of an action is increased or decreased by the policy multiplied by the rewards realized by that action. The equation for the policy gradient theorem is given as:
This equation states that the gradient of a given policy for the parameters can be set to an expectation . That expectation is a score function times the log-probability of the chosen action and , the reward from the agent taking action in state .
A policy gradient allows an agent to explore many variations and compute which ones will give the greatest reward quickly and efficiently. This technique is not without flaws though. For instance, policy-gradient methods often will converge to a local maximum instead of a global optimum. Calculating a policy-gradient can take longer to train than other kinds of on-policy methods. The policy gradient itself can generate high variance, meaning the estimation of the gradient itself is highly inaccurate. This outcome can lead the process to misestimate the trajectory of the gradient and how much policy changes affect it.
A classic example of a policy gradient algorithm is trust region policy optimization (TRPO)2. It was a major breakthrough when it was first introduced, but it has several well-known issues that make it hard to use in practice.
PPO is an improvement on using a policy gradient to learn. Fundamentally, it looks for how to take the largest possible improvement step on a policy by using the current data without making such large policy updates that the performance collapses. In policy-gradient methods, the policy is improved by nudging its parameters in the direction that increases expected reward. However, large updates can destroy performance and a single overly aggressive gradient step can change agent behavior dramatically and collapse learning. Earlier algorithms (like TRPO) solved this issue by enforcing a constraint on how far the new policy can move from the old one, but they were mathematically complex.
PPO keeps policy updates close to the previous policy while avoiding the need for complex constraints. Instead of choosing the next update stochastically, PPO uses what is called a clipped surrogate objective that stops the policy from making large jumps. The policy will still improve—that is the gradient of the reward as the agent applies it and observes the outcomes, but it will do so safely.
With a policy gradient, each iteration takes a gradient ascent step on the policy objective function. The step size presents a challenge. If the step is too small, the training process will be slow but if it’s too high, there will be too much variability in the policy for the agent to find an optimal solution.
In PPO, the idea is to constrain policy updates with a clipped surrogate objective function that constrains the policy change to a specific range. This function is given as:
To break this formula down, the first part is the ratio function: . That ratio function is:
This is the probability of taking action at state in the current policy divided by the previous one. This calculates the probability ratio of the current and old policy. If , then the action in state is more likely in the current policy than the old policy. If the ratio is less than 1, then the action is less like in the current policy than the old one.
This section shows how PPO clips the function by using surrogate constraints to penalizing changes that lead to a ratio away from 1. This step demonstrates that as the algorithm tries out changes in the policy, those changes are kept small by the epsilon value .
The primary advantage of this clipped surrogate approach is that it makes for small changes and is computationally efficient. Older methods like TRPO use KL divergence after calculating the objective function to constrain the policy update. That approach can be effective to constrain updates but requires complex implementation and longer computation time. PPO implements that clip probability ratio directly in the objective function, speeding up each training epoch.
Another advantage that PPO has over many other approaches is that it creates greater sample efficiency, that is, each environment interaction leads to greater policy improvement. PPO achieves this efficiency by creating a minibatch of rollout data that can use the same interaction data over multiple updates. Other approaches like Deep Q-Networks can be even more efficient but are more computationally intensive and thus not preferred when there is adequate data.
A crucial part of updating the policy is estimating just how much advantage the policy has over previous iterations. This step is more computationally complex than simply calculating the difference of rewards for two policies at a given step. Agents need to learn how much the distribution of action probabilities across multiple steps. Even in a relatively simple game of blackjack, the action that an actor takes won’t be realized until several rounds of play later. This method becomes even more important in a game like chess or in a scenario involving robotic navigation.
To learn these updates, PPO uses what is called an iterative generalized advantage estimation (GAE) strategy. This step helps determine how well the policy is working and how much it should move away from the current policy to improve. GAE computes advantages by exponentially weighting future errors, giving PPO a low-variance, low-bias learning signal that makes policy updates stable and sample-efficient.
Stopping too early in accumulating actual rewards introduces high bias because only a small portion of the true return is considered alongside minimal actual rewards. Accumulating too many rewards leads to high variance because relying on a larger number of real samples can make the estimate unstable.
Libraries like Stable-Baselines3 and RLlib provide full-featured implementations of PPO that can be applied to a wide range of domains and problems. There are also lighter implementations like CleanRL that include tutorials for self-teaching that are written in PyTorch and Tensorflow can be found on GitHub.
Libraries like transformer reinforcement learning (TRL) from HuggingFace are optimized specifically to help train transformer language models by using RL algorithms like PPO. In this context, PPO helps ensure that the models learn how to select responses that are better aligned with the model creator goals and with human feedback.
Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.
Move your applications from prototype to production with the help of our AI development solutions.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.
1. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
2. Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015, June). Trust region policy optimization. In International conference on machine learning (pp. 1889-1897). PMLR.