Global Sub-Sampled Attention

Overview of Global Sub-Sampled Attention (GSA)

Global Sub-Sampled Attention, or GSA, is a type of local attention mechanism used in the Twins-SVT architecture that summarizes key information for sub-windows and communicates with other sub-windows. This approach is designed to reduce the computational cost needed for attention mechanisms.

Local Attention Mechanisms

Before diving into GSA, it's important to understand what an attention mechanism is. An attention mechanism is a way for neural networks to focus on specific areas of input data that are important for making predictions. In the case of image processing, this means looking at certain regions of an image to understand what objects or features are present.

Local attention mechanisms are a way of limiting the amount of computation needed for attention mechanisms in convolutional neural networks (CNNs). Instead of looking at every possible region of an image, local attention mechanisms only look at a subset of regions, or sub-windows, in the image. This can be thought of as subsampling the input data.

GSA in Twins-SVT

In the Twins-SVT architecture, GSA is used as a local attention mechanism for sub-windows of an image. This approach involves summarizing the key information for each sub-window by using a representative. This representative is then used to communicate with other sub-windows by serving as the key in self-attention mechanisms.

By using the sub-sampled feature maps as the key in attention operations, the computational cost for attention mechanisms can be reduced. This cost reduction comes from reducing the number of regions that the CNN needs to look at, which can make training faster and more efficient.

Computing the Optimal Window Size

The performance of GSA depends on choosing the right window size, or the size of the sub-windows that are being used for attention mechanisms. The optimal window size depends on the resolution of the input image and the size of the summarizing window.

For simplicity, a window size of 7x7 is used in all stages of the Twins-SVT architecture. However, theoretically, optimal window sizes can be computed for each stage of the network based on the resolution of the feature maps.

In stages with lower resolutions, the summarizing window size is controlled to avoid too small of an amount of generated keys. Specifically, window sizes of 4x4, 2x2 and 1x1 are used for the last three stages, respectively.

Conclusion

Global Sub-Sampled Attention is an effective local attention mechanism used in the Twins-SVT architecture for reducing the computational cost of attention mechanisms in CNNs. By using sub-sampled feature maps as attention keys, GSA can improve the speed and efficiency of CNN training. The optimal window size for GSA depends on the resolution of the input image and can be computed for each stage of the network.