A3C

The Asynchronous Advantage Actor Critic (A3C) is a policy gradient algorithm used in reinforcement learning. This algorithm maintains a policy $\pi\left(a\_{t}\mid{s}\_{t}; \theta\right)$ and an estimate of the value function $V\left(s\_{t}; \theta\_{v}\right)$ in order to learn how to solve a given problem.

How A3C Works

A3C operates in the forward view and takes a mix of $n$-step returns to update both the policy and the value function. The policy and the value function are updated either after every $t\_{\text{max}}$ actions or when a terminal state is reached. This algorithm updates the policy and value function by computing the gradient of the logarithm of the policy multiplied by the estimate of the advantage function. The advantage function measures how much better taking an action at a given state is compared to taking an average action at the same state.

The estimate of the advantage function, called $A\left(s\_{t}, a\_{t}; \theta, \theta\_{v}\right)$, is computed as:

$$\sum^{k-1}\_{i=0}\gamma^{i}r\_{t+i} + \gamma^{k}V\left(s\_{t+k}; \theta\_{v}\right) - V\left(s\_{t}; \theta\_{v}\right)$$

where $k$ is the number of steps used in the advantage function, and it varies from state to state and is upper-bounded by $t\_{max}$. The critics in A3C learn the value function while multiple actors train in parallel and sync with global parameters every so often.

Gradients are accumulated as part of the training for stability. This is like parallelized stochastic gradient descent, where there are many parallel versions of the critic and actor, and they all update in different directions. The gradient of all these versions is then accumulated to update the parameters of the critics and actors.

In practice, we share some of the parameters between the policy and the value function. Typically, we use a convolutional neural network with one softmax output for the policy $\pi\left(a\_{t}\mid{s}\_{t}; \theta\right)$ and one linear output for the value function $V\left(s\_{t}; \theta\_{v}\right)$. All non-output layers are shared between the policy and the value function. This allows A3C to learn efficiently across a wide range of tasks.

Advantages of A3C

One of the key advantages of A3C is its ability to scale to larger problems by exploiting parallel computation. A3C is highly efficient at reducing the amount of time needed to train models as it leverages multiple threads simultaneously. Additionally, A3C is very stable and can handle large and complex problems by taking advantage of the architecture's shared parameters.

Overall, A3C is a powerful algorithm that has seen widespread use across many applications in machine learning, especially within the realm of robotic manipulation and control of dynamic systems. Its ability to learn policies effectively and efficiently has made it a popular choice in the research community as it continues to perform well in a variety of domains.

A3C

How A3C Works

Parameter Sharing

Advantages of A3C