Attention Free Transformer

In the world of machine learning, Attention Free Transformer (AFT) is a new variant of a multi-head attention module that improves efficiency by doing away with dot product self attention. Instead, AFT combines the key and value with learned position biases, and then multiplies it with the query in an element-wise fashion. This new operation has a memory complexity that is linear with both the context size and dimension of features, making it compatible with both large input and model sizes.

The Working of AFT

When given an input X, AFT first transforms it linearly into a query(Q=XWQ), key(K=XWK) and value(V=XWV). It then performs the following operation:

$$ Y=f(X);\text { } Y_{t}=\sigma_{q}\left(Q_{t}\right) \odot \frac{\sum_{t^{\prime}=1}^{T} \exp \left(K_{t^{\prime}}+w_{t, t^{\prime}}\right) \odot V_{t^{\prime}}}{\sum_{t^{\prime}=1}^{T} \exp \left(K_{t^{\prime}}+w_{t, t^{\prime}}\right)} $$

This equation operates as follows: for each target position t, AFT performs a weighted average of values, the result of which is combined with the query through element-wise multiplication. The weighting is made up of keys and a set of learned pair-wise position biases. This approach provides the immediate advantage of not having to compute and save the expensive attention matrix, while maintaining the global interactions between query and values as MHA does.

Advantages of AFT

There are several advantages of using AFT over traditional methods of attention. First, it is computationally efficient, which means that it not only speeds up the learning process but also saves memory. As discussed, this is made possible by eliminating the need to compute and store the costly attention matrix. This makes it feasible to use attention with larger input sequences and more complex models, which might not have been possible before.

Secondly, AFT is also much simpler to implement. Unlike the traditional attention models, it is not necessary to keep track of the pairwise dot products or the self-attention scores at each step. Instead, AFT requires only a single learned set of vectors to incorporate contextual information between the nodes. This also makes the AFT model more interpretable and easier to analyze.

Lastly, AFT has a number of practical applications. In areas such as natural language processing (NLP), where efficiency and performance are crucial, AFT has the potential to significantly speed up the data processing and learning. Additionally, as the models that are currently being developed are getting larger and more complex, AFT could be an essential component to keep these models manageable.

AFT vs MHA

While AFT and Multi-Head Attention (MHA) modules are both types of attention models, they differ in their approach. MHA is used to learn the relationship between the different parts of the input sequence, using pairwise dot products as part of its method. AFT, on the other hand, replaces this with a set of learned position biases, which are then used to weight the values when combining them with the query. Additionally, while MHA is limited in its ability to handle large input sequences due to its computational complexity, AFT's linear memory complexity makes it much more adaptable to larger datasets.

AFT Implementation

Implementing AFT in a neural network framework is relatively simple. First, we start with a linear transformation of the input into Query(Q), Key(K) and Value(V) matrices, as described earlier. This is followed by a calculation of a learned pairwise position bias matrix. Then, we apply the self-attention strategy by computing an element-wise product of the sigmoid activated Query matrix and a weighted Average of the Key and Value matrix. The output of this product is then concatenated to the original input, and the resulting matrix is passed through a feedforward network. This method is used for both supervised and unsupervised learning tasks.

AFT is an efficient and adaptable attention mechanism that removes the time and memory complexities of traditional self-attention mechanisms. It has the potential to significant speed up learning in NLP and other applications where massive amounts of data are processed. Additionally, its simplicity makes it more interpretable than traditional attention mechanisms, which will be useful for researchers and developers. The AFT model is relatively easy to implement and can be used for both supervised and unsupervised learning in neural network architectures. In summary, AFT is a game-changer that could revolutionize the world of machine learning.