QHM

Quasi-Hyperbolic Momentum (QHM) is a technique used in stochastic optimization to improve momentum SGD (Stochastic Gradient Descent). This is achieved by combining an SGD step with a momentum step. In other words, QHM changes momentum SGD by averaging the SGD step and momentum step.

Understanding QHM

Before delving into QHM, it is necessary to understand what momentum SGD is. Momentum SGD is a popular optimization algorithm used in machine learning that accelerates SGD by adding momentum. This helps to speed up the convergence of the algorithm.

Now, QHM adds another level of optimization to momentum SGD. It does this by using a formula where the gradient of the loss function, along with the current parameter values, are used to calculate the momentum. This momentum is used to update the current parameter values during each step.

Mathematically, QHM can be represented as follows:

$$ g\_{t+1} = \beta{g\_{t}} + \left(1-\beta\right)\cdot{\nabla}\hat{L}\_{t}\left(\theta\_{t}\right) $$ $$ \theta\_{t+1} = \theta\_{t} - \alpha\left[\left(1-v\right)\cdot\nabla\hat{L}\_{t}\left(\theta\_{t}\right) + v\cdot{g\_{t+1}}\right]$$

The above formula shows that the update rule for QHM has two main parts. The first part calculates the momentum updates, while the second part calculates the SGD update. The momentum component has momentum $\beta$, which is useful for decreasing the parameter values. The SGD component has a step size $\alpha$, which is useful for increasing the parameter values.

The Importance of QHM

QHM is an important technique because it provides an efficient way to train deep learning models. Training deep learning models is difficult due to the huge number of parameters involved. By using QHM, it is possible to update these parameters efficiently and effectively.

Another benefit of QHM is that it can handle noise in the data more effectively. For example, in real-world scenarios, data can be noisy due to a number of factors. With QHM, it is possible to overcome these challenges and still achieve good results.

Optimizing QHM

One of the key aspects of QHM is to find the optimal values for the parameters $\beta$ and $v$. The authors suggest a rule of thumb of $v = 0.7$ and $\beta = 0.999$. However, these values may not be optimal for all situations.

To find the optimal values for these parameters, it is necessary to perform a grid search. This involves trying out different combinations of the parameters and determining which values produce the best results. This can be a time-consuming process, but it is essential for achieving optimal performance.

Quasi-Hyperbolic Momentum (QHM) is an important technique used in stochastic optimization. It combines momentum SGD with an SGD step, which helps to accelerate convergence and improve model training. QHM is particularly useful for training deep learning models with noisy data. With optimal parameter values, QHM can achieve impressive results and help to push the boundaries of machine learning research further.