Channel-wise Cross Attention

What is Channel-wise Cross Attention?

Channel-wise cross attention is a module used in the UCTransNet architecture to perform semantic segmentation. It fuses features of inconsistent semantics between the Channel Transformer and U-Net decoder, eliminating ambiguity with the decoder features. The operation is a blend of convolutional neural networks and transformer networks, which work together to improve the performance of the model across various tasks.

How does Channel-wise Cross Attention Work?

The module takes the i-th level Transformer output O_i and i-th level decoder feature map D_i as inputs. A global average pooling (GAP) layer is applied, which produces a vector with its k-th channel. The global spatial information is embedded in the vector, and an attention mask is generated using the equation:

$$ \mathbf{M}\_{i} = \mathbf{L}\_{1} \cdot \mathcal{G}\left(\mathbf{O\_{i}}\right) + \mathbf{L}\_{2} \cdot \mathcal{G}\left(\mathbf{D}\_{i}\right) $$

The equation encodes channel-wise dependencies, where L₁ and L₂ are the weights of two linear layers, and ReLU operator δ(.) is applied. A single linear layer and sigmoid function are used to build the channel attention map, following Empirically Constrained Attention Network (ECA-Net), which shows that avoiding dimensionality reduction is important for learning channel attention. The resulting vector recalibrates or excites O_i to

$$ \mathbf{\bar{O}\_{i}} = \sigma\left(\mathbf{M\_{i}}\right) \cdot \mathbf{O\_{i}} $$

where the activation σ(M_i) denotes the importance of each channel. The masked O_i is concatenated with the up-sampled features of the i-th level decoder to obtain the output.

Benefits of Channel-wise Cross Attention

Channel-wise cross attention has several benefits, including:

Better feature fusion: It helps to fuse features of inconsistent semantics between the Channel Transformer and U-Net decoder, which eliminates ambiguity in the decoder's features.
Improved model performance: The fusion of different features leads to the model's improved performance across various tasks.
Efficient use of channel-wise dependencies: The channel attention map generated by the module encodes channel-wise dependencies, which help to recalibrate or excite the transformer features accordingly.
Generalizability: The module works well for various segmentation tasks, making it a flexible and generalizable solution.

Applications of Channel-wise Cross Attention

Channel-wise cross attention has several applications, including:

Medical image segmentation: The module works well for medical image segmentation tasks and has been used in the segmentation of liver tumors and brain tumors, among others.
Object detection: The module has also been applied in object detection tasks, where it helps to fuse features from the encoder and decoder networks, leading to improved model performance.
Image recognition: The channel-wise cross-attention module has been used in image recognition tasks such as image classification and image localization.

Channel-wise cross-attention is a powerful module used in semantic segmentation tasks. It fuses features of inconsistent semantics between the Channel Transformer and U-Net decoder, which eliminates ambiguity in the decoder's features. The recalculation or excitations of transformer features based on channel-wise dependencies lead to improved model performance, making channel-wise cross attention a versatile and effective tool for various segmentation tasks.