Absolute Position Encodings

Absolute Position Encodings: Enhancing the Power of Transformer-based Models

For decades, natural language processing (NLP) models have struggled to outperform human-like accuracy when it comes to understanding and manipulating natural language. In recent years, however, many researchers have been working on improving the power of NLP models by developing better algorithms for word embeddings, such as absolute position encodings.

What are Absolute Position Encodings?

Absolute position encodings are a type of position embedding that enhances the performance of transformer-based models. They are added to input embeddings at the bottom of the encoder and decoder stacks. These positional encodings have the same dimension as the embeddings so that the two can be summed. Each dimension of the positional encoding corresponds to a sinusoid, and the wavelengths form a geometric progression from $2\pi$ to $10000 \dot 2\pi$.

These sinusoidal functions were chosen because the authors hypothesized that they would allow the model to learn to attend by relative positions easily. This is because, for any fixed offset k, $\text{PE}_{pos+k}$ can be represented as a linear function of $\text{PE}_{pos}$.

Why are Absolute Position Encodings Important?

Absolute position encodings were developed to address a couple of significant limitations of the original transformer model:

Positional Information

The transformer model does not incorporate positional information about the tokens in a sequence, making it challenging to apply it to sequential data like text. This is because the transformer model works solely on the embeddings of the tokens and ignores the order in which they appear in the sequence. This has led to many researchers using attention mechanisms to augment the transformer model with positional information.

Learning Long-term Dependencies

The transformer model struggles to learn long-term dependencies in sequences, which can limit its ability to capture the context and meaning of a text accurately. Absolute position encodings help address this limitation by providing a way to introduce long-term dependencies into the model.

The Implementation of Absolute Position Encodings

The implementation of absolute position encodings is simple. Each position $pos$ of the input sequence is encoded as a vector of sinusoidal functions:

$$\begin{aligned} \text{PE}(pos, 2i) &= \sin\left(pos/10000^{2i/d_{model}}\right) \\ \text{PE}(pos, 2i+1) &= \cos\left(pos/10000^{2i/d_{model}}\right) \end{aligned}$$

In these equations, $i$ is the dimension of the encoding, and $d_{model}$ is the size of the word embeddings. As such, the size of the encoding matrix is $d_{model} \times L$, where $L$ is the length of the sequence.

The use of sine and cosine functions with varying frequencies was a deliberate choice. It introduces a way for the model to learn to attend by relative positions more quickly. Essentially, it allows the model to detect when two positions are similar or when they are far apart by measuring the similarities and dissimilarities between the sine and cosine values for the two positions.

Advantages of Absolute Position Encodings

The primary advantage of using absolute position encodings is an improvement in model performance. By including positional information in the embeddings, the model can learn to attend to the order in which the tokens appear in the sequence, which can improve accuracy on tasks that require the model to understand the meaning and context of the text better.

Absolute position encodings can also help with model efficiency. They allow the model to capture long-term dependencies more effectively, which can sometimes be difficult for traditional NLP models. This means that the model can be more efficient at processing and analyzing sequential data, such as natural language text.

Absolute position encodings are a crucial development in the field of NLP. By providing a way to incorporate positional information into transformer-based models, they improve the accuracy and efficiency of these models, making them better at understanding the context and meaning of natural language text.

As NLP continues to evolve, it is likely that we will see more algorithmic breakthroughs like this one. Ultimately, these advances could lead to a future where machines can understand natural language as well as or even better than humans, opening up new possibilities for communication, knowledge management, and much more.