Spatially Separable Self-Attention: A Method to Reduce Complexity in Vision Transformers
As computer vision tasks become more complex and require higher resolution inputs, the computational complexity of vision transformers increases. Spatially Separable Self-Attention, or SSSA, is an attention module used in the Twins-SVT architecture that aims to reduce the computational complexity of vision transformers for dense prediction tasks.
SSSA is composed of locally-grouped self-attention (LSA) and global sub-sampled attention (GSA). The LSA is a self-attention mechanism that operates within a sub-window of the input image. This sub-windowing allows for more efficient computation by reducing the total number of tokens that need to be processed in each self-attention operation. Meanwhile, the GSA is a self-attention mechanism that operates globally across the entire input image. However, to reduce computational complexity, the GSA only interacts with representative keys generated by the sub-sampling functions from each sub-window.
Formally, SSSA can be written as:
$$\hat{\mathbf{z}}\_{i j}^{l}=\text { LSA }\left(\text { LayerNorm }\left(\mathbf{z}\_{i j}^{l-1}\right)\right)+\mathbf{z}\_{i j}^{l-1} $$ $$\mathbf{z}\_{i j}^{l}=\mathrm{FFN}\left(\operatorname{LayerNorm}\left(\hat{\mathbf{z}}\_{i j}^{l}\right)\right)+\hat{\mathbf{z}}\_{i j}^{l} $$ $$ \hat{\mathbf{z}}^{l+1}=\text { GSA }\left(\text { LayerNorm }\left(\mathbf{z}^{l}\right)\right)+\mathbf{z}^{l} $$ $$ \mathbf{z}^{l+1}=\text { FFN }\left(\text { LayerNorm }\left(\hat{\mathbf{z}}^{l+1}\right)\right)+\hat{\mathbf{z}}^{l+1}$$ $$i \in\{1,2, \ldots ., m\}, j \in\{1,2, \ldots ., n\} $$
The SSSA has been shown to achieve strong performance on various dense prediction tasks while maintaining computational efficiency. Its effectiveness makes it a promising technique for use in computer vision applications that require high-resolution input data.