Spatial-Reduction Attention

Spatial-Reduction Attention (SRA):

What is Spatial-Reduction Attention?

Spatial-Reduction Attention (SRA) is a type of multi-head attention used in the Pyramid Vision Transformer architecture. Its purpose is to reduce the scale of the key and value before the attention operation takes place. This means that the computational and memory requirements needed for the attention layer are reduced.

How Does SRA Work?

The SRA in stage i can be formulated as follows:

$$ \text{SRA}(Q, K, V)=\text { Concat }\left(\operatorname{head}\_{0}, \ldots \text { head }\_{N\_{i}}\right) W^{O} $$

At its core, SRA uses a series of linear projection parameters which reduce the spatial dimension of the input sequence. This is done by reducing the dimension of the key and value vectors before the attention operation.

Each head of SRA has its own set of linear projection parameters. For example, $W\_{j}^{Q} \in \mathbb{R}^{C\_{i} \times d\_{\text {head }}}$, $W\_{j}^{K} \in \mathbb{R}^{C\_{i} \times d\_{\text {head }}}$, $W\_{j}^{V} \in \mathbb{R}^{C\_{i} \times d\_{\text {head }}}$, and $W^{O} \in \mathbb{R}^{C\_{i} \times C\_{i}}$ are linear projection parameters.

Here’s a more detailed breakdown of the formula:

$$\text{ head}\_{j}=\text { Attention }\left(Q W\_{j}^{Q}, \operatorname{SR}(K) W\_{j}^{K}, \operatorname{SR}(V) W\_{j}^{V}\right) $$

Where Concat $(\cdot)$ is the concatenation operation, $N\_{i}$ is the head number of the attention layer in Stage $i$, and $\left.d\_{\text {head }}\right)$ is equal to $\frac{C\_{i}}{N\_{i}} . \text{SR}(\cdot)$ is the operation used for reducing the spatial dimension of the input sequence ($K$ or $V$ ).

For example:

$$ \text{SR}(\mathbf{x})=\text{Norm}\left(\operatorname{Reshape}\left(\mathbf{x}, R\_{i}\right) W^{S}\right) $$

Here, $\mathbf{x} \in \mathbb{R}^{\left(H\_{i} W\_{i}\right) \times C\_{i}}$ represents an input sequence, and $R\_{i}$ denotes the reduction ratio of the attention layers in Stage $i$. Reshape $\left(\mathbf{x}, R\_{i}\right)$ is an operation of reshaping the input sequence $\mathbf{x}$ to a sequence of size $\frac{H\_{i} W\_{i}}{R\_{i}^{2}} \times\left(R\_{i}^{2} C\_{i}\right)$. $W\_{S} \in \mathbb{R}^{\left(R\_{i}^{2} C\_{i}\right) \times C\_{i}}$ is a linear projection that reduces the dimension of the input sequence to $C\_{i}$. $\text{Norm}(\cdot)$ refers to layer normalization.

What are the Benefits of SRA?

SRA is designed to reduce the computational and memory overhead required for multi-head attention. This makes it particularly useful when working with deep learning models which have a large number of attention layers.

Using SRA can reduce the total number of parameters used in the model by approximately 20 percent. This helps to simplify the model and reduce its complexity, making it easier to train and improving its efficiency.

Another benefit of SRA is that it can help to improve the accuracy of the model. By reducing the overhead required for multi-head attention, SRA allows the model to focus more on the task at hand, rather than being bogged down by processing overhead.

Spatial-Reduction Attention (SRA) is an important innovation in deep learning which helps to reduce the memory and computational overhead required for multi-head attention. By reducing the dimensionality of the key and value vectors, SRA helps to simplify the model and improve its efficiency. It is particularly useful when working with deep learning models which have a large number of attention layers, and can help to improve the accuracy of the model by allowing it to focus more on the task at hand. Overall, SRA is an important innovation which has the potential to significantly improve the performance and efficiency of deep learning models.