RAdam

Rectified Adam, also known as RAdam, is a modification of the Adam stochastic optimizer, which aims to solve the bad convergence problem experienced by Adam. It does so by rectifying the variance of the adaptive learning rate.

The Problem with Adam

The authors of RAdam contend that the primary issue with Adam is its adaptive learning rate's undesirably high variance in the early stages of model training due to the low number of training samples. This characteristic of Adam often leads to bad convergence, and it serves as the motivation for creating RAdam.

The Solution with RAdam

RAdam solves the problem of Adam's adaptive learning rate by using smaller learning rates in the first few epochs of training. This solution justifies the warmup heuristic.

The following equations show the computation steps for adaptive learning rate, variance rectification, and parameter update:

g_t = ∇_θf_t(θ_t−1)

v_t = 1/β₂v_t−1 + (1−β₂)g²_t

m_t = β₁m_t−1 + (1−β₁)g_t

∆ = m_t / (1−β^t₁)

ρ_t = ρ_∞ - 2tβ^t₂/ (1 − β^t₂)

ρ_∞ = 2 / (1 - β₂) -1

If the variance is measurable - ρ_t > 4 then:

l_t = sqrt((1−β^t₂) / v_t)

r_t = sqrt(((ρ_t − 4)(ρ_t - 2)ρ_∞) /((ρ_∞ − 4)(ρ_∞ − 2)ρ_t))

θ_t = θ_t−1 − α_tr_t∆l_t

If the variance is not measurable, we update instead with:

θ_t = θ_t−1 − α_t∆

Recitified Adam is a variation of the Adam stochastic optimiser. It is intended to solve the convergence problems caused by Adam's high variance adaptive learning rate in the early stages of training. This is achieved by using smaller learning rates until model convergence becomes adequate. The result is more stable convergence and more accurate models.