Dual Attention Network

DANet: A Framework for Natural Scene Image Segmentation

DANet is a novel framework that was proposed by Fu et al. for natural scene image segmentation. The field of scene segmentation involves identifying different objects in an image and segmenting them into separate regions. Traditional encoder-decoder structures do not make use of the global relationships between objects while RNN-based structures rely heavily on the output of long-term memorization. This led to the development of DANet, which uses a self-attention mechanism instead of just stacking convolutions to compute the spatial attention map.

How DANet Works

DANet uses both position and channel attention modules to capture feature dependencies in spatial and channel domains. The process starts by applying convolution layers to the input feature map $X$ in the position attention module to obtain new feature maps. Then, the position attention module aggregates the features at each position using a weighted sum of features, where the weights are determined by feature similarity between corresponding pairs of positions. The channel attention module has a similar form except for dimensional reduction to model cross-channel relations. Finally, the outputs from the two branches are fused to obtain final feature representations.

For simplicity, the feature map $X$ is reshaped to $C\times (H \times W)$, whereupon the overall process can be written as:

Q, K, V = WqX, WkX, WvX
Y^pos = X + VSoftmax(Q^TK)
Y^chn = X + Softmax(XX^T)X
Y = Y^pos + Y^chn

Here, $W_q$, $W_k$, $W_v \in \mathbb{R}^{C\times C}$ are used to generate new feature maps. The position attention module enables DANet to capture long-range contextual information and adaptively integrate similar features at any scale from a global viewpoint. The channel attention module is responsible for enhancing useful channels as well as suppressing noise. By taking spatial and channel relationships into consideration explicitly, DANet improves the feature representation for scene segmentation.

Although DANet produces better results than traditional methods, it is computationally costly, especially for large input feature maps.

DANet is a novel framework that has been developed to improve the accuracy of natural scene image segmentation. It involves using both position and channel attention modules to capture feature dependencies in spatial and channel domains. DANet has shown to be effective in capturing long-range contextual information and adaptively integrating similar features at any scale from a global viewpoint. By explicitly considering spatial and channel relationships, DANet improves the feature representation for scene segmentation. Although it is computationally costly for large input feature maps, DANet produces better results than traditional methods.