Attention with Linear Biases

ALiBi, or Attention with Linear Biases, is a new method for inference extrapolation in Transformer models. This method is used instead of position embeddings in computing the attention scores for each head. In other words, ALiBi adds a constant bias to each attention score to simplify calculations and avoid learning the scalar throughout training. The rest of the computation remains unchanged. The following provides more information about this exciting new method.

The Transformer model is widely known for its ability to perform well in natural language processing (NLP) tasks such as language translation and sentiment analysis. Position embeddings are traditionally used in the Transformer architecture to provide the model with information on the order of the input sequence. However, position embeddings can sometimes be difficult to implement and may have limited effectiveness in certain NLP applications. ALiBi is an alternative to position embeddings that simplifies the calculation of attention scores without sacrificing accuracy.

What is ALiBi?

ALiBi stands for Attention with Linear Biases. It is a modification of the attention sublayer within the Transformer model. When computing attention scores for each head, a constant bias is added to each score. The bias is head-specific and is set to a scalar known as $m$. The bias is constant and not learned during training. The advantage of using a constant bias is that it simplifies calculations and eliminates the need for position embeddings, which can be difficult to implement in some NLP applications.

How Does ALiBi Work?

The calculation of attention scores in the Transformer model involves the dot product of two vectors, $\textbf{q}\_{i}$ and $\textbf{k}\_{j}$, where $i$ and $j$ represent indices in the input sequence. In the unmodified attention sublayer, this dot product is followed by the application of the softmax function. Here, ALiBi modifies this process by adding a bias term to the dot product before the softmax function is applied. The bias term is a scalar value, $m$, that is head-specific and is set to a constant value rather than being learned during training. The new formula for the attention scores in ALiBi is as follows:

Attention$(\textbf{q}, \textbf{k}) = \text{softmax}(\frac{\textbf{q}\textbf{k}^T + m}{\sqrt{d_k}}) \textbf{v}$

In this formula, $\textbf{q}$ and $\textbf{k}$ represent vectors of dimension $d_k$ and $m$ represents the head-specific bias scalar. The calculation of attention scores using this formula is simpler and faster than using traditional position embeddings, and the resulting scores are just as accurate in predicting the output of the model.

Why Use ALiBi Instead of Position Embeddings?

Traditional position embeddings have some limitations in certain NLP applications. For example, when words are related to their surrounding context in a non-linear way, position embeddings may not capture this relationship accurately. Additionally, position embeddings require additional computation, and since they are learned during training, they may require additional time to optimize. ALiBi, on the other hand, is a simpler and faster method that is easier to implement and requires less computation. Furthermore, ALiBi does not require additional parameters to be optimized, since the head-specific bias scalar is set and not learned. This makes ALiBi an attractive option for NLP applications where position embeddings may not be ideal.

ALiBi is a new method for computing attention scores in the Transformer model that provides a simpler alternative to position embeddings. By adding a constant bias to each attention score, computations are simplified and position embeddings are not needed. The resulting attention scores have been shown to be just as accurate as those obtained using position embeddings, and the method requires fewer computations and optimizations. ALiBi is an exciting new tool for natural language processing applications that can help to improve the speed and accuracy of Transformer models.