Strided Attention

Strided Attention: Understanding its Role in Sparse Transformers

Many machine learning models and architectures rely on the concept of attention, which allows the model to focus on specific parts of the input when making predictions. One type of attention is known as self-attention, which is commonly used in natural language processing tasks. One variant of self-attention is called strided attention, which has been proposed as part of the Sparse Transformer architecture. In this overview, we will delve into what strided attention is, how it works, and where it is best used.

Self-Attention and Sparse Transformers

Before discussing strided attention, it is crucial to understand what self-attention and sparse transformers are. Self-attention, as the name suggests, refers to the ability of a model to attend to its own input. Self-attention operates by creating a set of vectors from the input, which are then transformed into queries, keys, and values. An attention score is then computed between each query and key, which allow us to weight the values to produce the output for each query. Self-attention is often used in language models since it allows the model to understand the context of a word in a sentence.

Sparse Transformers, on the other hand, are a variation of the transformer architecture, which uses self-attention to process sequential input data, such as text. Transformers have been widely used in natural language processing tasks such as machine translation, text summarization, and sentiment analysis. The goal of Sparse Transformers is to make transformer-based models more efficient by reducing the computational cost of self-attention.

Strided Attention

Strided attention is a type of factorized attention pattern that has been proposed as part of Sparse Transformers. Factorized attention is a way of reducing the number of query-key comparisons that need to be computed, which can significantly reduce the computational cost of self-attention. Strided attention is a specific pattern that has one head attend to the previous $l$ locations, while the other head attends to every $l$th location. The value of $l$ is chosen to be close to $\sqrt{n}$, where $n$ is the number of input vectors.

In self-attention layers, the connectivity pattern is represented as a set of indices to which the output vector attends. Therefore, Strided attention redefines the set of indices such that every element attends to either the previous $l$ locations or every $l$th location. This pattern is visualized in the figure below:

The output vector is then computed by taking a weighted sum of the transformations of the input vectors based on the scaled dot-product similarity of the queries and keys. The output at each position is a sum of the values weighted by the scaled dot-product similarity of the keys and queries.

Strided attention is particularly useful for data that have a structure that aligns with the stride, like images or some types of music. However, the authors find that for data without a periodic structure, like text, the network can fail to route information properly with the strided pattern since spatial coordinates for an element do not necessarily correlate with the positions where the element may be most relevant in the future.

Strided attention is a type of factorized attention pattern that has been proposed as part of Sparse Transformers. It allows the model to attend to the previous $l$ locations and every $l$th location. The value of $l$ is chosen to be close to $\sqrt{n}$, where $n$ is the number of input vectors. This pattern is particularly useful for data that have a structure that aligns with the stride, like images or some types of music. However, strided attention may not work well for data without a periodic structure, like text. By using strided attention in Sparse Transformers, the computational cost of self-attention can be significantly reduced, making them more efficient for large datasets.