Fast Attention Via Positive Orthogonal Random Features

Introduction:

FAVOR+, short for Fast Attention Via Positive Orthogonal Random Features, is an attention mechanism that is used in the Performer architecture. It uses efficient methods such as kernel approximation and random features for approximating both softmax and Gaussian kernels. With the FAVOR+ mechanism, queries and keys are represented as matrices, and an efficient attention mechanism is created. This process is achieved by utilizing positive random features and entangling samples to be exactly orthogonal using the Gram-Schmidt orthogonalization procedure.

The FAVOR+ Mechanism:

The FAVOR+ mechanism is divided into three parts:

The FA-Part:

The first part of the FAVOR+ mechanism is the FA-part. The FA-part is achieved by representing queries and keys as matrices of the form A(i, j) = K(qi, kj), where qi and kj represent the ith and jth query/key row-vector in Q/K. K is a kernel defined as a random mapping from d-dimensional space to r-dimensional positive real space. We can define the kernel as:

$$K(x, y) = E[\phi(x)^T\phi(y)]$$

Here, phi(u) is the random feature map for u in d-dimensional space, and r is greater than zero.

For Q' and K', we can use rows given by phi(qi) and phi(ki), where Q' and K' are in L x r-dimensional space. The attention mechanism can be represented as:

$$\hat{Att_{\leftrightarrow}}(Q, K, V) = \hat{D}^{-1}(Q'(K')^{T}V)$$

Here, V is the value matrix, and:

$${\hat{D}} = \text{diag}(Q'(K')\mathbb{1}_{L})$$

The R-Part:

The second part of the FAVOR+ mechanism is the R-part. This part approximates the softmax kernel by using trigonometric functions, which are then regularized using positive random features (PRFs). This method helps to reduce the variance of the estimator, allowing for a smaller number of random features to be used.

The OR+ Part:

The final part of the FAVOR+ mechanism is the OR+ part. It aims to entangle samples to be exactly orthogonal using the Gram-Schmidt orthogonalization procedure. This helps to further minimize the variance of the estimator and improve the efficiency of the attention mechanism.

Together, these three parts of the FAVOR+ mechanism create an efficient attention mechanism that utilizes random features and kernel approximation to speed up the computation of softmax and Gaussian kernels.

Conclusion:

The FAVOR+ mechanism is an innovative method for creating efficient attention mechanisms in machine learning models such as the Performer architecture. By using positive random features and kernel approximation, FAVOR+ is able to speed up the computation of both softmax and Gaussian kernels. Additionally, the OR+ part of FAVOR+ helps to reduce the variance of the estimator, enabling the use of fewer random features and further improving the efficiency of the attention mechanism.