SM3

SM3 is a memory-efficient adaptive optimization method used in machine learning. It helps reduce the memory overhead of the optimizer, allowing for larger models and batch sizes. This new approach has retained the benefits of standard per-parameter adaptivity while reducing the memory requirements, making it a popular choice in modern machine learning.

Why traditional methods don't work for large scale applications

Standard adaptive gradient-based optimizers, such as AdaGrad and Adam, tune the learning rate for each parameter during the optimization process using cumulative second-order statistics. They provide superior convergence properties and are attractive in large scale applications due to their moderate time and space requirements which are linear in the number of parameters. However, the recent advances in natural language processing show that models with a large number of parameters (ranging from 10^8 to 10^10) trained with adaptive optimization methods achieve state-of-the-art results.

In such cases, the memory overhead of the optimizer can restrict the size of the model that can be used as well as the batch size, both of which can have a dramatic effect on the quality of the final model.

How SM3 works

The SM3 method reduces the memory overhead of the optimizer and is general enough that it can easily be extended to arbitrary cover sets. Cover sets are typically selected in practice such that parameters in each of the sets have second-order statistics of similar magnitude.

Observations have shown that in standard neural networks, certain entries of the stochastic gradients have (on average) similar values, and exhibit what we refer to as an activation pattern. In embedding layers of deep networks, an entire row (or column) is either zero or non-zero, while in intermediate layers gradients associated with the same unit are of similar order of magnitude. A similar phenomenon is observed in the second-order statistics maintained by adaptive methods.

For parameters of deep networks that are organized as a collection of tensors, cover sets consisting of slices of codimension one for each tensor are formed. Thus, for an mxn parameter matrix, the cover consists of rows and columns of the matrix. The memory requirements therefore drop from mxn to merely m+n. For parameter tensors of rank p, with dimensions n1 ... np, the reduction in memory consumption is even more pronounced, dropping from the product of all the dimensions to the sum of all dimensions. This virtually eliminates the memory overhead associated with maintaining the adaptive learning rates.

Using SM3 on your model

When using SM3 on your model, consider the following techniques:

Learning rate warm-up:

The formula used for learning rate warm-up is `learning_rate = lr_constant * tf.minimum(1.0, (warm_up_step / global_step) ** p)`. `p` can be set to 1 or 2. Linear ramp-up of the learning rate is obtained when `p=1` while quadratic ramp-up is obtained when `p=2` (preferred).

`warm_up_step` is usually set as 5% of overall steps. Initially, the norm of the preconditioned gradient is much larger than the norm of the weights. Learning rate warm-up allows for heuristically fixing this scale mismatch.

Learning rate decay:

SM3 uses accumulated gradient squares for the decay. Each coordinate gets its own natural decay based on the scales of the gradients over time. Hence, users need not put in an external learning rate decay schedule. Moreover, experiments with translation and language models have shown that this approach is superior to a hand-tuned learning rate decay schedules which are typically combined with exponential moving averages of the gradient squares.

Polyak averaging of parameters:

It's useful to run polyak averaging of the parameters. These parameters are then used in inference/serving. Using the averaged parameters instead of the last iterate typically improves the overall performance of the model. An alternative to polyak averaging, which does not make use of extra memory, is to decay the learning rate from the constant to zero for the last 10% of the steps of your entire training run. This phase is termed a cool-down phase of the model. As training makes smaller and smaller steps, the final iterate can be thought of as an average iterate.

Overall, using SM3 on your model can greatly improve its performance and allow for larger models and batch sizes. By reducing the memory overhead of the optimizer, SM3 provides a practical, efficient, and general approach to training deep neural networks.