Adaptive Masking

Adaptive Masking is a type of attention mechanism used in machine learning that allows a model to learn its own context size to attend over. This is done by adding a masking function for each head in Multi-Head Attention to control for the span of the attention.

What is a Masking Function?

A masking function is a non-increasing function that maps a distance to a value in [0, 1]. This function is added to the attention mechanism to pay more attention to important information and ignore the irrelevant information. The adaptive masking technique uses a soft masking function, which is a piecewise function inspired by a study conducted by Jernite et al. in 2017.

How Does Adaptive Masking Work?

Adaptive masking uses a soft masking function, which is parametrized by a real value z in [0, S]. The value of S is the context size that the model is trained to learn. The shape of the soft masking function is influenced by a hyper-parameter R that determines its softness.

The attention weights from the masked span are then computed using the following formula:

a_tr = m_z(t-r) exp(s_tr) / Σ_q=t-S^t-1 m_z(t-q) exp(s_tq)

Here, m_z is the soft masking function parametrized by z, and s is the input sequence.

Regularization with Adaptive Masking

Regularization is a technique used in machine learning to reduce the effect of overfitting, which occurs when the model performs well on the training data but poorly on the testing data. In Adaptive Masking, regularization is achieved using a L₁ penalization added on the parameters z_i for each attention head i of the model to the loss function.

The formula for the loss function is:

L = - log P(w₁, ..., w_T) + λ/M Σ_i z_i

Here, P is the probability of the output sequence, w is the input sequence, and λ is the regularization hyperparameter. M is the number of heads in each layer. This formulation is differentiable in the parameters z_i, which means they are learnt jointly with the rest of the model.

Adaptive Masking is a powerful attention mechanism in machine learning that allows the model to learn its own context size for attend over. By using a soft masking function and regularization techniques, Adaptive Masking reduces the effect of overfitting and improves the performance of the model. This technique is widely used in natural language processing, computer vision, and other areas where attention mechanisms are required to handle large amounts of data.