LayerScale

LayerScale is a method used in the development of vision transformer architectures. It is designed to improve the training dynamics of deeper image transformers by adding a learnable diagonal matrix after each residual block. This simple layer improves the training dynamic by allowing for the training of high-capacity image transformers that require depth.

What is LayerScale?

LayerScale is a per-channel multiplication of the vector output of each residual block in the transformer architecture. It is a multiplication by a diagonal matrix on the output of each residual block that improves the training dynamic by allowing for the training of deeper high-capacity image transformers that require depth. It adds a small learnable diagonal matrix that is initialized close to (but not at) 0 after every residual block. This diagonal matrix is a per-channel multiplication of the vector produced by each residual block, rather than a single scalar.

Why is LayerScale Important?

LayerScale is important because it improves the training dynamic of deeper high-capacity image transformers. It is designed to help these transformers benefit from depth by allowing them to effectively train and produce accurate results. Without LayerScale, these transformers may struggle to train effectively, making it difficult for developers to create accurate and reliable image transformers.

LayerScale is particularly beneficial for vision transformers, which require depth to develop effective visual representations. LayerScale allows these transformers to train at deeper depths, resulting in more accurate and detailed visual representations. This is particularly important for tasks like image classification, object detection, and image segmentation.

How Does LayerScale Work?

LayerScale works by adding a learnable diagonal matrix after each residual block. This diagonal matrix is a per-channel multiplication of the vector output of each residual block. Specifically, LayerScale is a multiplication by a diagonal matrix on the output of each residual block. The diagonal values of this matrix are initialized with a small value so that the initial contribution of the residual branches to the function is small. This allows the network to integrate additional parameters progressively during the training, closer to the identity function, and train effectively even at deeper depths.

The parameters $\lambda\_{l,i}$ and $\lambda\_{l,i}^{\prime}$ in the formula above are learnable weights that allow for diversity in the optimization of the network. The diagonal values are all initialized to a fixed small value, set differently based on the depth of the network.

LayerScale vs. Other Normalization Strategies

LayerScale is often compared to other normalization strategies like ActNorm and LayerNorm. However, LayerScale seeks a different effect than these normalization strategies. ActNorm and LayerNorm are executed on the output of each layer, whereas LayerScale is only applied on the output of each residual block.

ActNorm is a data-dependent initialization that calibrates activations so that they have zero-mean and unit variance, much like BatchNorm. In contrast, in LayerScale, the diagonal matrix is initialized with small values so that the initial contribution of the residual branches to the function is small. This allows the network to integrate additional parameters progressively during the training and train effectively even at deeper depths.

LayerScale differs from other normalization strategies like ReZero, Skipinit, Fixup, and T-Fixup. These strategies adjust the whole layer by a single learnable scalar, while LayerScale multiplies the output by a diagonal matrix. LayerScale offers more diversity in the optimization of the network by allowing for greater flexibility when adjusting the scaling of the network.

LayerScale is a critical method used in the development of vision transformer architectures. It is designed to improve the training dynamics of deeper high-capacity image transformers by allowing them to effectively train and produce accurate results. The method adds a learnable diagonal matrix after each residual block, resulting in a per-channel multiplication of the vector output of each residual block. This simple layer allows for the training of high-capacity image transformers that require depth and improves the training dynamics, helping transformers to integrate additional parameters progressively during the training. LayerScale provides more diversity in the optimization of the network, making it a valuable tool for developers working on deep image transformer architectures.