Deformable Attention Module

In the world of deep learning, the Deformable Attention Module is a revolutionary tool used to solve one of the biggest challenges of the Transformer attention model. The Transformer attention model looked over all possible spatial locations, leading to convergence and feature spatial resolution issues. The Deformable Attention Module addressed these issues and improved the Transformer's efficiency.

What is the Deformable Attention Module?

The Deformable Attention Module is a component of the Deformable DETR architecture. It only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps. This is in contrast to the Transformer attention model terminology that looks over all possible spatial locations, causing issues of convergence and feature spatial resolution.

The Deformable Attention Module uses the concept of deformable convolution, where it primarily focuses on a small fixed number of keys for each query. This reduced focus helps improve the overall efficiency of the model. By limiting the attention map to a few key points, it becomes easier to converge on needed features and maintain spatial resolution.

How Does it Work?

The Deformable Attention Module operates using two main parameters - the content feature of a query element and the reference point of that query element in a 2D space. The Deformable Attention Feature is computed by obtaining the attention head and the total sampled key number. The attention weight and sampling offset of each query element are then calculated using linear projection, and finally, the attention is normalized to get the combined output.

Mathematically speaking, the Deformable Attention Feature can be represented using the following formula:

$$\text{DeformAttn}\left(\mathbf{z}\_{q}, \mathbf{p}\_{q}, \mathbf{x}\right)=\sum\_{m=1}^{M} \mathbf{W}\_{m}\left[\sum\_{k=1}^{K} A\_{m q k} \cdot \mathbf{W}\_{m}^{\prime} \mathbf{x}\left(\mathbf{p}\_{q}+\Delta \mathbf{p}\_{m q k}\right)\right]$$

Here, $m$ is the attention head, $k$ is the number of sampled keys, and $K$ is the total number of sampled keys $(K \ll H W)$. $\Delta p_{m q k}$ and $A_{m q k}$ denote the sampling offset and attention weight of each key point, and $p\_{q}+\Delta p\_{m q k}$ is fractional.

Bilinear interpolation is applied to compute $\mathbf{x}\left(\mathbf{p}\_{q}+\Delta \mathbf{p}\_{m q k}\right)$. Both $\Delta \mathbf{p}\_{m q k}$ and $A\_{m q k}$ are obtained via linear projection over the query feature $z_{q}$. The query feature $z\_{q}$ is given to a linear projection operator of $3 M K$ channels, where the first $2 M K$ channels encode the sampling offsets $\Delta p\_{m q k}$, and the remaining $M K$ channels are fed to a softmax operator. The softmax operator provides the attention weights $A\_{m q k}$.

Applications

The Deformable Attention Module has been used in various Deep Learning applications, including object detection, semantic segmentation, and face alignment. The most commonly used application is object detection, where it has been shown to improve the accuracy of the model.

In the context of object detection, the Deformable Attention Module localizes the object while searching for its features. It does this by using a multi-scale reference point that helps detect the relevant features in the image. This feature enhances the model's ability to recognize complex objects in images and improves accuracy when dealing with occlusions, image distortion, and scaling.

The Deformable Attention Module is a powerful tool in the world of deep learning that helps address some of the most challenging issues in Transformer attention models. By focusing only on a few key points and using a deformable attention map, the Deformable Attention Module improves convergence, feature spatial resolution, accuracy, and reduces computation time.