RL_Paper_04:PPO

Proximal Policy Optimization Algorithms

Paper Source
Journal:
Year: 2017
Institute: OpenAI
Author: John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov
#Deep Reinforcement Learning #Policy Gradient

PPO(Proximal Policy Optimization), a much simpler to implement, better sample complexity, policy gradient method that has a novel objective with clipped problability ratios, which forms a pessimistic estimate of the performance of the policy ($\theta_{old}$ is the vector of policy parameters before the update).

$r_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}, \quad so, r(\theta_{old}) = 1$

In policy gradient, the loss function is:

$L^{PG}(\theta) = \hat{E}[log\pi_{\theta}(a_t|s_t)\hat{A}_t] \\ \hat{g} = \hat{E}[\nabla_{\theta}log\pi_{\theta}(a_t|s_t)\hat{A}_t]$

where $\pi_{\theta}$ is a stochastic policy and $\hat{A}_t$ is an estimator of the advantage function at timestep $t$.
While it is appealing to perform multiple steps of optimization on this loss $L^{PG}$ using the same trajectory, doing so is not well-justified, and empirically it often leads to destructively large policy updates.

In TRPO, the loss function becomes:

$L^{CPI}(\theta) = \hat{E}[r_t(\theta)\hat{A}_t]$

Now, in PPO, the loss function becomes:

$L^{CLIP}(\theta) = \hat{E}[min(r_t(\theta)\hat{A}_t, clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_t]$

This objective can further be augmented by adding an entropy bonus to ensure sufficient exploration, as suggested in past work. The final objective is:

$L^{CLIP+VF+S(\theta)} = \hat{E_t}[L^{CLIP}_t(\theta) − c_1 L^{VF}_t(\theta) + c_2 S[\pi\theta](s_t)]$

where $c_1$, $c_2$ are coefficients, and $S$ denotes an entropy bonus, and $L^{VF}$ is a squared-error loss $(V_{\theta}(s_t) - V_t^{target})^2$.

And the total algorithm is as follows: