Proximal Policy Optimization

Overview of Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a form of policy gradient method for reinforcement learning. PPO was created to provide an algorithm that combines efficient data usage and reliable performance, while using only first-order optimization. PPO involves modifying the objective to penalize changes that move away from the probability ratio of one, which provides an upper bound on the unclipped objective. In this article, we will explain PPO in more detail and how it works.

How PPO Works

PPO functions by maximizing a “surrogate” objective, which is a lower bound of the unclipped objective. The surrogate objective is calculated using the probability ratio value, $r\_{t}\left(\theta\right)$.

The objective in TRPO is defined by a conservative policy iteration. The equation is defined as:

$$ L^{\text{CPI}}\left({\theta}\right) = \hat{\mathbb{E}}\_{t}\left[\frac{\pi\_{\theta}\left(a\_{t}\mid{s\_{t}}\right)}{\pi\_{\theta\_{old}}\left(a\_{t}\mid{s\_{t}}\right)})\hat{A}\_{t}\right] $$

Where $\pi\_{\theta}\left(a\_{t}\mid{s\_{t}}\right)$ represents the policy network function and $\hat{A}\_{t}$ represents the advantage function.

In order to prevent large policy updates, we need to modify the objective function. The objective function is modified to penalize changes to the policy that move away from $r\_{t}\left(\theta\right)$ and this is where PPO comes in.

Here is the equation for the PPO objective function:

$$ J^{\text{CLIP}}\left({\theta}\right) = \hat{\mathbb{E}}\_{t}\left[\min\left(r\_{t}\left(\theta\right)\hat{A}\_{t}, \text{clip}\left(r\_{t}\left(\theta\right), 1-\epsilon, 1+\epsilon\right)\hat{A}\_{t}\right)\right] $$

Where $\epsilon$ is a hyperparameter, which is a small value, usually 0.2. The $\text{clip}$ function inside the min function removes the incentive for moving the probability ratio outside of the $\left[1-\epsilon, 1+\epsilon\right]$ interval. Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound on the unclipped objective. By doing this, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.

It is important to note that when applying PPO to a network where we have shared parameters for actor and critic functions, an error term on value estimation and an entropy term is added to the objective function to encourage exploration.

Advantages of Using PPO

One of the advantages of using PPO is that it provides better performance compared to other policy gradient methods. PPO is more robust and efficient when handling complex, continuous environments due to its reliable performance and data efficiency. Additionally, PPO allows for safe updates of policy parameters with a global constraint, making convergence faster than other gradient-based methods.

Another advantage of PPO is that it is computationally less expensive than other methods. PPO uses only first-order optimization instead of the second-order optimization used in TRPO, which is computationally expensive. First-order optimization is faster to compute, resulting in a better performing model in a shorter amount of time.

Disadvantages of Using PPO

While PPO provides a more stable learning process, it is difficult to tune the hyperparameter values. The value of $\epsilon$ for the clipping function is difficult to set, which impacts the performance of the model. Additionally, the hyperparameters for the neural network layers also need to be tuned for optimal performance, which means that a lot of trial and error is involved in the process.

Proximal Policy Optimization (PPO) is a reliable and efficient policy gradient method for reinforcement learning. It uses a lower bound of the unclipped objective function to penalize changes that move away from the desired probability ratio. This method results in better performance, computational efficiency, and a more stable learning process compared to other methods. However, the difficulty in tuning the hyperparameter values is a disadvantage of the approach.