AggMo

AggMo or Aggregated Momentum is a variant of the classical momentum stochastic optimizer. It is designed to resolve the problem of choosing a momentum parameter, which simplifies the optimization process of deep learning models.

What is Momentum in Deep Learning Optimization?

Momentum is a term used in deep learning optimization, which indicates the rate of learning and how quickly the model adjusts while training. Momentum is a dynamic factor that affects the learning rate over time, allowing faster convergence and smoother optimization. It is considered one of the important hyperparameters in deep learning algorithms, and selecting the right momentum value is crucial for better optimization results.

Understanding AggMo

AggMo is an attempt to solve the problem of selecting the right momentum value. It maintains several velocity vectors with different beta parameters, which are then averaged out for updating the model's parameters.

How Does AggMo Work?

The AggMo optimizer maintains $K$ velocity vectors with different values of the discount factor $\beta$. These $K$ momentum buffers are combined to update the model's parameters. The discount factor $\beta$ is a hyperparameter that controls the influence of a gradient descent step on the current velocity. It ranges between 0 and 1 with higher values representing the older iterations' influence on the current velocity.

The AggMo update rule involves calculating the velocity vector $v^{\left(i\right)}_{t}$ for each $K$ using the following expression:

$$v^{\left(i\right)}_{t} = \beta^{(i)}v^{\left(i\right)}_{t-1} - \nabla_{\theta}f\left(\mathbf{\theta}_{t-1}\right)$$

Here, $\nabla_{\theta}f\left(\mathbf{\theta}_{t-1}\right)$ is the gradient or the rate of change of the objective function, with respect to the model's parameters $\theta$. The velocity vectors calculate the gradient of the objective function for each $K$ momentum buffer and combine them to update the model's parameters $\mathbf{\theta}$ as follows:

$$ \mathbf{\theta}_{t} = \mathbf{\theta}_{t-1} + \frac{\gamma_{t}}{K}\sum_{i=1}^{K}v_{t}^{\left(i\right)} $$

The above update rule computes the next value of $\mathbf{\theta}$ by adding the sum of the weighted velocity vectors $v_{t}^{\left(i\right)}$ to the current parameter value $\mathbf{\theta}_{t-1}$. Here, $\gamma_{t}$ is the learning rate, which determines the step size for updating the model's parameters.

Advantages of AggMo

AggMo provides several advantages when optimizing deep learning models. Some of these include:

Robustness: AggMo is more robust to hyperparameters' changes and does not require much parameter tuning, unlike other stochastic optimization methods.
Convergence Speed: Using multiple momentum buffers in AggMo results in faster convergence and learning. Hence it takes fewer iterations to optimize the network's parameters.
Accurate Optimization: AggMo contributes to accurate optimization of deep learning models by carefully maintaining the gradients' direction and weightage in each momentum buffer.
Less Memory: AggMo requires less memory than other optimizers since it works with several momentum buffers instead of maintaining the entire sequence for each parameter.

Aggregated Momentum, or AggMo, is an advanced optimization technique that allows for faster convergence and more accurate optimization of deep learning models. By utilizing multiple momentum buffers, AggMo avoids the problem of selecting a suitable momentum parameter value while being more robust against hyperparameter tuning. AggMo forms an essential optimization technique for optimizing large-scale deep learning models used in various applications, including computer vision, natural language processing, and speech recognition.