Locally-Grouped Self-Attention

A Computation-Friendly Attention Mechanism: Locally-Grouped Self-Attention

Locally-Grouped Self-Attention (LSA) is a type of attention mechanism used in the Twins-SVT architecture. The purpose of this mechanism is to reduce the computational cost of self-attention in neural networks.

How LSA Works

LSA is designed based on the concept of dividing the feature maps of an input image into smaller sub-windows. The image is divided into M x N sub-windows of equal size, and self-attention is applied only within each sub-window.

This design is similar to the multi-head design in self-attention, where communication occurs only within the channels of the same head. Each group contains a certain number of elements, and the computation cost of self-attention within each group is significantly less than the computation cost of full self-attention.

Specifically, if we let k1 = H/n and k2 = W/n, the cost of LSA can be computed as O(k1k2HWd), where H and W are the height and width of the image, d is the depth of the feature maps. This cost is significantly more efficient when k1 and k2 are much less than H and W, respectively, and grows linearly with HW if k1 and k2 are fixed.

Although LSA is a computation-friendly attention mechanism, it faces the challenge of limited communication between different sub-windows. This means that the information processed using LSA is limited to being processed locally. This can limit the receptive field and significantly degrade performance, as shown in experiments. Thus, a mechanism to communicate between different sub-windows is needed, similar to how we cannot replace all standard convolutions by depth-wise convolutions in CNNs.

Efficiency of LSA

The efficiency of LSA is due to the fact that it reduces the number of computation steps needed to compute attention. In standard self-attention, the computation cost is O(H^2W^2d), where H and W are the height and width of the feature maps and d is the depth of the feature maps. This cost is generally too high for large images and can lead to computationally expensive training and inference steps.

In contrast, the computation cost of LSA is significantly less than the computation cost of standard self-attention. LSA reduces the computation cost to O(k1k2HWd), which is much more efficient when k1 and k2 are less than H and W, respectively. Overall, this makes LSA a more practical and efficient attention mechanism.

Conclusion

Locally-Grouped Self-Attention is a computation-friendly attention mechanism used in neural networks. It reduces the computation cost of attention by dividing the feature maps of an input image into smaller sub-windows and applying self-attention only within each sub-window. However, communication between different sub-windows is needed to avoid limitations on receptive field size and performance degradation. The efficiency of LSA is due to its ability to significantly reduce the computation cost of attention, making it a more practical and efficient attention mechanism overall.