Stein Variational Policy Gradient

Stein Variational Policy Gradient (SVPG) Overview

Stein Variational Policy Gradient (SVPG) is a policy gradient-based method used in reinforcement learning to simultaneously exploit and explore multiple policies. Instead of learning a single policy, SVPG models a distribution of policy parameters.

Traditional Policy Optimization vs. SVPG

Traditional policy optimization uses a single policy for decision-making. It works by evaluating the reward or utility of different actions and then selecting the action that maximizes the reward. This approach can be successful, but it has limitations. For example, a single policy can get trapped in a suboptimal solution and not be able to find better ones.

In contrast, SVPG works by modeling a distribution of policy parameters, which allows for the exploration of multiple policies. By optimizing this distribution of policy parameters, SVPG encourages exploration of the parameter space while also optimizing the expected utility of policies drawn from this distribution.

Entropy Regularization and SVGD

SVPG optimizes the distribution of policy parameters with entropy regularization. Entropy, in this context, refers to a measure of randomness or uncertainty. By regularizing entropy, SVPG encourages exploration and diversity in the policy parameter space, while also maintaining some exploitation to converge towards an optimal solution.

Stein variational gradient descent (SVGD) is used to optimize this distribution. SVGD is a method that uses deterministic dynamics to transport a set of particles to approximate a given target posterior distribution.

The Update Equation

The update of SVPG takes the form of the following equation:

$$ \nabla\theta\_i = \frac{1} {n}\sum\_{j=1}^n \nabla\_{\theta\_{j}} \left(\frac{1}{\alpha} J(\theta\_{j}) + \log q\_0(\theta\_j)\right)k(\theta\_j, \theta\_i) + \nabla\_{\theta\_j} k(\theta\_j, \theta\_i)$$

The magnitude of α adjusts the relative importance of the policy gradient and the prior term. The repulsive term is used to diversify particles to enable parameter exploration. A suitable α provides a good trade-off between exploration and exploitation.

If α is too large, the particles will only be consistent with the prior distribution. As α approaches 0, the algorithm is reduced to running n copies of independent policy gradient algorithms if the particles are initialized differently. An annealing scheme of α allows efficient exploration in the beginning of training and focuses on exploitation towards the end of training.

SVPG is a powerful method for reinforcement learning that allows for the exploration of multiple policies simultaneously. By modeling a distribution of policy parameters and optimizing with entropy regularization and SVGD, SVPG encourages exploration and diversity in the policy parameter space while also maintaining some exploitation to converge towards an optimal solution.