Location Sensitive Attention

Location Sensitive Attention: An Overview

Location Sensitive Attention is a mechanism that extends the additive attention mechanism to use cumulative attention weights from previous decoder time steps as an additional feature. This allows the model to move forward consistently through the input, mitigating potential failure modes where some subsequences are repeated or ignored by the decoder.

The attention mechanism is a critical component of sequence-to-sequence models, enabling the model to focus on different parts of the input sequence as it generates output. In the original additive attention mechanism, a sequential representation from a BiRNN encoder and the previous state of a recurrent neural network is combined to produce alignment scores for each position in the input sequence. These scores are then used to weight the inputs before passing them through the decoder.

Additive Attention

Additive attention is the basis for location sensitive attention. It involves calculating a score for each position in the input sequence based on its similarity to the decoder state. The scores are then used to weight the input sequence before passing it through the decoder. Mathematically, this can be represented as follows:

$ e\_{i, j} = w^{T}\tanh\left(W{s}\_{i-1} + Vh\_{j} + b\right) $

In this equation, $h$ represents the sequential representation of the input sequence from a BiRNN encoder, and $s\_{i-1}$ represents the previous state of a recurrent neural network (LSTM or GRU). The vectors $w$ and $b$ are parameter vectors, while $W$ and $V$ are matrices. The function $\tanh$ is used to produce a non-linear transformation of the input.

This scoring mechanism produces a weight for each position in the input sequence, which is then used to weight the corresponding input before passing it through the decoder.

Extending Additive Attention to Location Sensitive Attention

The goal of location sensitive attention is to enable the model to track its position within the input sequence as it generates output. To achieve this, Location Sensitive Attention uses cumulative attention weights from previous decoder time steps as an additional feature.

This is achieved by extending the additive attention mechanism to take into account the alignment produced at the previous step. Specifically, additional vectors $f\_{i,j}$ are extracted by convolving the previous alignment $\alpha\_{i-1}$ with a matrix $F\in\mathbb{R}^{k\times{r}}$:

$ f\_{i,j} = F*{\alpha\_{i-1}} $

These vectors are then used by the scoring mechanism to update the alignment score for each position in the input sequence:

$ e\_{i,j} = w^{T}\tanh\left(W{s}\_{i-1} + Vh\_{j} + Uf\_{i,j} + b\right) $

In this equation, the additional matrix $U$ is used to weight the vectors $f\_{i,j}$. The resulting alignment scores are used to weight the input sequence before passing it through the decoder at the current time step.

Location Sensitive Attention is a powerful mechanism for enabling sequence-to-sequence models to track their position within an input sequence as they generate output. By using cumulative attention weights from previous decoder time steps as an additional feature, the model is encouraged to move forward consistently through the input, mitigating potential failure modes where some subsequences are repeated or ignored by the decoder.