AdaMax

What is AdaMax?

AdaMax is a mathematical formula that builds on Adam, which stands for Adaptive Moment Estimation. Adam is a popular optimization algorithm used in deep learning models for training the weights efficiently. AdaMax generalizes Adam from $l_2$ norm to $l_\infty$ norm. But what does that mean?

Understanding the $l_2$ norm and $l_\infty$ norm

Before we dive into AdaMax, let's first examine the $l_2$ norm and $l_\infty$ norm.

The $l_2$ norm is a mathematical formula used to measure the magnitude of a vector. It is also known as the Euclidean norm. The formula for $l_2$ norm is:

$$\left\lVert \boldsymbol{x} \right\rVert_2 = \sqrt{\sum_{i=1}^{n} x_i^2}$$

Here, $\boldsymbol{x}$ is a vector of $n$ dimensions. The $l_2$ norm formula calculates the length of the vector by summing up the squares of all the elements and taking the square root of their sum.

The $l_\infty$ norm, on the other hand, is also known as the maximum norm or Chebyshev norm. The formula for $l_\infty$ norm is:

$$\left\lVert \boldsymbol{x} \right\rVert_\infty = \max_{i}|x_i|$$

The $l_\infty$ norm calculates the magnitude of a vector by finding the element with the maximum absolute value.

How AdaMax works

Now that we understand $l_2$ norm and $l_\infty$ norm, we can talk about how AdaMax works.

The AdaMax formula uses $l_\infty$ norm in place of $l_2$ norm in the Adam update equation. The Adam update equation is:

$$\theta_{t+1} = \theta_{t} - \frac{\eta}{\sqrt{\hat{v}_{t} + \epsilon}}\hat{m}_{t}$$

Here, $\theta_t$ represents the weights at time $t$, $\eta$ is the learning rate, $\epsilon$ is a small value added to the denominator to prevent division by zero, $\hat{m}_{t}$ is the exponentially weighted moving average of the gradients, and $\hat{v}_{t}$ is the exponentially weighted moving average of the squared gradients.

The AdaMax formula replaces $\sqrt{\hat{v}_{t} + \epsilon}$ with $u_t$:

$$\theta_{t+1} = \theta_{t} - \frac{\eta}{u_{t}}\hat{m}_{t}$$

Here, $u_t$ is computed as:

$$u_{t} = \max(\beta_{2}\cdot{v}_{t-1}, |g_{t}|)$$

Here, $\beta_{2}$ is a decay rate, $v_{t-1}$ is the moving average of the past squared gradients, $g_{t}$ is the gradient at time $t$.

The maximum function in the $u_t$ formula ensures that the $l_\infty$ norm is used when computing $u_t$. The AdaMax formula is useful when handling very large gradients that can cause learning instability in Adam.

Default values in AdaMax

Common default values used in AdaMax are $\eta$=0.002 and $\beta_1$=0.9 and $\beta_2$=0.999. The learning rate $\eta$ determines how fast the weights are updated, and the decay rates $\beta_1$ and $\beta_2$ control the exponential weights for the moving averages. These default values have been found to work well in many deep learning models, but they are not optimal for all use cases. Tuning the values of these parameters is often necessary to achieve good performance in a particular model.

In summary, AdaMax is a generalization of Adam that uses $l_\infty$ norm instead of $l_2$ norm to calculate the magnitude of the gradients. This makes AdaMax more stable when dealing with large gradients but requires some parameter tuning to optimize its performance in different deep learning models. Once tuned, however, AdaMax can provide a fast and efficient way to train deep learning models.