Fixed Factorized Attention

Fixed Factorized Attention: A More Efficient Attention Pattern

When working with natural language processing, neural networks have to process large amounts of data. One way to do this is to use an attention mechanism that focuses on certain parts of the input. Fixed factorized attention is a type of attention mechanism that does just that.

Self-Attention

A self-attention layer is a foundational part of many neural networks that work with natural language. This layer maps a matrix of input embeddings to an output matrix and is parameterized by a connectivity pattern. The output vector is a weighted sum of transformations of the input vectors. Full self-attention for autoregressive models allows every element to attend to all previous positions and its own position.

Factorized self-attention, on the other hand, has multiple attention heads. These heads focus on subsets of the indices of the input vectors, rather than all of them. The goal with the Sparse Transformer architecture was to find efficient choices for these subsets to improve the efficiency of the neural network.

Fixed Factorized Attention

Fixed factorized attention is a type of factorized attention pattern where specific cells summarize previous locations and propagate that information to all future cells. The cells that make up the input vectors focus on specific sub-blocks of the input rather than the entire input, improving efficiency and reducing computational cost.

Formally for Fixed Factorized Attention, cells attend to sub-blocks of length c within a larger block size l. This pattern can be visualized in the figure to the right. If the stride is 128 and c = 8, then all future positions greater than 128 can attend to positions 120-128, and so forth.

Computational Cost

A fixed-attention pattern with a small c limits the expressivity of the network significantly, as many representations in the network are only used for one block whereas a small number of locations are used by all blocks. The authors found that for typical values of l in {128, 256}, choosing c in {8, 16, 32} performs well, although this increases the computational cost of this method by c in comparison to the strided attention pattern.

Multiple Heads

The authors found that having multiple attention heads attend to distinct sub-blocks of length c within the block of size l was preferable to having them attend to the same sub-block. This improves the efficiency of the network and allows for more flexibility in attention patterns.

Fixed factorized attention is an efficient attention mechanism that improves the computational cost of neural networks working with natural language processing. By breaking up the input into sub-blocks, rather than considering the entire input, neural networks can work more efficiently and effectively. For typical values of l, choosing c in {8, 16, 32} is a good choice, and having multiple attention heads attend to distinct sub-blocks of length c further improves efficiency.