Sliding Window Attention

Sliding Window Attention is a way to improve the efficiency of attention-based models like the Transformer architecture. It uses a fixed-size window of attention around each token to reduce the time and memory complexity of non-sparse attention. This pattern is especially useful for long input sequences where non-sparse attention can become inefficient. The Sliding Window Attention approach employs multiple stacked layers of windowed attention, resulting in a large receptive field.

Motivation for Sliding Window Attention

The original Transformer formulation has a self-attention component that becomes inefficient when dealing with long input sequences. This is because it has a time and memory complexity of $O\left(n^{2}\right)$, where $n$ is the input sequence length. However, local context is still important for these long sequences. That is why the Sliding Window Attention pattern uses a fixed-size window around each token to attend to a smaller set of contextual tokens.

How Sliding Window Attention Works

Each token in the sequence attends to $\frac{1}{2}w$ tokens on each side, where $w$ is the size of the fixed window. The computation complexity of this pattern is $O\left(n×w\right)$, which scales linearly with input sequence length $n$. This makes Sliding Window Attention much more efficient for long input sequences.

Using Stacked Layers of Sliding Window Attention

To build representations that incorporate information across the entire input, multiple stacked layers of Sliding Window Attention can be used. The receptive field size depends on the number of layers and window size. For example, with a transformer of $l$ layers, the receptive field size is $l × w$ (assuming $w$ is fixed for all layers). Depending on the application, different values of $w$ can be used for each layer to balance between efficiency and model representation capacity.

Overall, Sliding Window Attention is a powerful tool for improving the efficiency and effectiveness of attention-based models. It allows these models to better handle long input sequences and incorporate local context into its representations. By using stacked layers of Sliding Window Attention, models can even build a large receptive field and learn from the entire input sequence.