Prioritized Experience Replay

Prioritized Experience Replay in Reinforcement Learning

In recent years, reinforcement learning has become a popular area of research for developing intelligent machines that can improve their performance through experience. One technique used in this field is experience replay, where previously observed actions and outcomes are stored in a memory buffer and later used to train the agent through repeated exposure.

One issue with experience replay is that it treats all experiences equally regardless of their significance in improving the agent's performance. Prioritized experience replay addresses this problem by prioritizing experiences with high expected learning progress, as measured by the magnitude of their temporal-difference error (TD error).

How Prioritized Experience Replay Works

The prioritization of experiences in prioritized experience replay is done using a stochastic sampling method that interpolates between pure greedy prioritization and uniform random sampling. The probability of a transition being sampled is proportional to its priority and determined by the exponent alpha.

The exponent alpha controls the degree of prioritization, with alpha=0 corresponding to the uniform case. The higher the alpha value, the more the prioritization of experiences dominates the learning process. By prioritizing experiences that have a higher expected learning progress, the agent can learn new skills and improve its performance more quickly.

Reducing Bias with Importance Sampling

Although prioritized experience replay can improve the performance of the agent, it can introduce bias in the learning process. To address this issue, importance sampling (IS) weights are used. IS weights correct the bias by compensating for the non-uniform probabilities of prioritization.

The IS weights are determined by the probability of sampling a particular transition (P(i)) and the total number of transitions (N). The weights can be folded into the Q-learning update to help improve stability and scaling of the update.

Two Types of Prioritization in Prioritized Experience Replay

There are two types of prioritization methods in prioritized experience replay:

Proportional based prioritization, where priority is determined by the magnitude of the TD error (|delta_i|) plus a small value (epsilon).
Rank-based prioritization, where priority is determined by the rank of the transition when the replay memory is sorted according to the magnitude of the TD error.

The hyperparameters used for proportional based prioritization are alpha=0.7 and beta_0=0.5. For rank-based prioritization, the hyperparameters used are alpha=0.6 and beta_0=0.4.

Prioritized experience replay is a useful technique in reinforcement learning that allows for the prioritization of experiences with high learning progress, which can improve the performance of the agent. However, it is important to address any bias introduced by the prioritization through importance sampling, and careful selection of hyperparameters in the two types of prioritization methods can also influence the success of the learning process.