Global and Sliding Window Attention

Overview of Global and Sliding Window Attention

Global and Sliding Window Attention is a pattern used in attention-based models to improve efficiency when dealing with long input sequences. It is a modification of the original Transformer model which had non-sparse attention with a self-attention component. The self-attention component had a time and memory complexity of O(n^2) which made it difficult to scale to longer input sequences. Global and Sliding Window Attention overcomes this issue by adding global attention on a few pre-selected input locations.

The Challenge with Non-Sparse Attention

The original Transformer model had a self-attention component that treated all tokens in the input sequence equally. This meant it had a time and memory complexity of O(n^2) where n was the input sequence length. This made it difficult to scale the model to long input sequences, leading to performance issues. The model was not efficient enough to use in practice.

The Solution: Global and Sliding Window Attention

Global and Sliding Window Attention is an improvement over the original Transformer model. It optimizes efficiency when dealing with long input sequences. Global attention is added to a few pre-selected input locations to improve performance. In this way, Global and Sliding Window Attention increases efficiency without sacrificing accuracy.

How Global and Sliding Window Attention Works

The Global and Sliding Window Attention pattern is achieved through two types of attention: global and sliding window attention. Global attention provides full attention over all inputs. Sliding window attention, on the other hand, provides attention over a window of inputs. The attention is symmetric, meaning each token with global attention attends to all tokens in the sequence, while all tokens in the sequence attend to it.

The global attention is applied to a few tokens at custom pre-selected locations. These locations are chosen to capture the most relevant information in the sequence. The sliding window attention is applied over a window of inputs to capture information over a shorter range. The window slides over the input sequence, each time capturing information within that window.

Benefits of Global and Sliding Window Attention

Global and Sliding Window Attention has many benefits. First, it decreases the time and memory complexity of the attention component. This makes it efficient to scale to longer input sequences. Second, it improves performance with only a small number of global tokens. Third, it provides task-specific representations that are more effective than windowed or dilated attention patterns. Fourth, it is flexible and can be used for a wide range of applications, such as classification and question answering.

Applications of Global and Sliding Window Attention

The Global and Sliding Window Attention pattern can be applied to a wide range of natural language processing tasks. It has been successfully used in the Longformer model for classification and question answering tasks. In classification tasks, global attention is used for the CLS token, while in question answering tasks, global attention is provided on all question tokens. The pattern can also be used for sentiment analysis, text summarization, and language modelling.

Global and Sliding Window Attention is an important pattern for attention-based models. It improves the efficiency of the attention component when dealing with long input sequences, without sacrificing accuracy. It provides task-specific representations that are more effective than windowed or dilated attention patterns. The pattern can be applied to a wide range of natural language processing tasks, making it an important tool for researchers and practitioners alike.