Shake-Shake Regularization

Shake-Shake Regularization: Improving Multi-Branch Network Generalization Ability

In the world of machine learning, deep neural networks are extensively used to solve complex problems. Convolutional neural network (CNN) is a popular type of deep neural network that is especially good at solving image classification problems. One of the CNN models that became widely known is the ResNet, which is short for residual network. ResNet is known for its deep architecture, having many layers that can extract high-level features from an input image. However, deep networks have a disadvantage called overfitting. Overfitting happens when a model learns to fit the training data well, but performs poorly on new data. To solve this problem, researchers have introduced a technique called Shake-Shake regularization.

What is Shake-Shake Regularization?

Shake-Shake regularization is a technique that improves the generalization ability of multi-branch networks by replacing the standard summation of parallel branches with a stochastic affine combination. In simpler terms, when building a deep neural network with multiple branches, instead of adding the outputs of each branch together, Shake-Shake regularization combines them in a random way during training. This randomness adds noise to the model, preventing it from overfitting to the training data.

How Does Shake-Shake Regularization Work?

A typical ResNet with two residual branches follows this equation:

$$x\_{i+1} = x\_{i} + \mathcal{F}\left(x\_{i}, \mathcal{W}\_{i}^{\left(1\right)}\right) + \mathcal{F}\left(x\_{i}, \mathcal{W}\_{i}^{\left(2\right)}\right) $$

where $x\_{i}$ is the input to the $i^{th}$ block, $\mathcal{F}$ is a function that applies convolutional filters on the input, and $\mathcal{W}\_{i}^{\left(1\right)}$ and $\mathcal{W}\_{i}^{\left(2\right)}$ are the weights used in the layers of each branch.

Shake-Shake regularization introduces a random variable $\alpha\_{i}$ following a uniform distribution between 0 and 1 during training:

$$x\_{i+1} = x\_{i} + \alpha\mathcal{F}\left(x\_{i}, \mathcal{W}\_{i}^{\left(1\right)}\right) + \left(1-\alpha\right)\mathcal{F}\left(x\_{i}, \mathcal{W}\_{i}^{\left(2\right)}\right) $$

The value of $\alpha\_{i}$ changes every time the model is trained. This randomness in combining the branches allows the model to explore different paths and avoid overfitting to any particular path. All $\alpha\_{i}$ are set to the expected value of $0.5$ at test time, following the same logic as for dropout.

Why is Shake-Shake Regularization Effective?

Shake-Shake regularization is effective in preventing overfitting because it adds controlled noise to the model. This noise comes from the stochastic affine combination of branches, which causes the model to explore different paths during training. This exploration allows the model to learn how to generalize better, since it's seeing more variations in the data. When the model generalizes better, it is able to perform better on new, unseen data, which is the ultimate goal of any machine learning model.

Another reason why Shake-Shake regularization is effective is that it is scalable. This means that it can be applied to deep networks with many branches, not just ResNets. The randomness introduced by Shake-Shake regularization does not depend on the number of branches the model has, making it a flexible technique that can be used in many different architectures.

Shake-Shake regularization is a technique that helps deep neural networks avoid overfitting to training data. By introducing randomness into the combination of branches during training, the model learns to generalize better and is able to perform better on new data. Shake-Shake regularization is scalable, flexible, and has been shown to perform well in various deep neural network architectures. It is an important technique that is widely used in machine learning today.