What is Channel-wise Cross Attention?
Channel-wise cross attention is a module used in the UCTransNet architecture to perform semantic segmentation. It fuses features of inconsistent semantics between the Channel Transformer and U-Net decoder, eliminating ambiguity with the decoder features. The operation is a blend of convolutional neural networks and transformer networks, which work together to improve the performance of the model across various tasks.
How does Channel-wise Cross Attention Work?
The module takes the i-th level Transformer output Oi and i-th level decoder feature map Di as inputs. A global average pooling (GAP) layer is applied, which produces a vector with its k-th channel. The global spatial information is embedded in the vector, and an attention mask is generated using the equation:
$$ \mathbf{M}\_{i} = \mathbf{L}\_{1} \cdot \mathcal{G}\left(\mathbf{O\_{i}}\right) + \mathbf{L}\_{2} \cdot \mathcal{G}\left(\mathbf{D}\_{i}\right) $$
The equation encodes channel-wise dependencies, where L1 and L2 are the weights of two linear layers, and ReLU operator δ(.) is applied. A single linear layer and sigmoid function are used to build the channel attention map, following Empirically Constrained Attention Network (ECA-Net), which shows that avoiding dimensionality reduction is important for learning channel attention. The resulting vector recalibrates or excites Oi to
$$ \mathbf{\bar{O}\_{i}} = \sigma\left(\mathbf{M\_{i}}\right) \cdot \mathbf{O\_{i}} $$
where the activation σ(Mi) denotes the importance of each channel. The masked Oi is concatenated with the up-sampled features of the i-th level decoder to obtain the output.
Benefits of Channel-wise Cross Attention
Channel-wise cross attention has several benefits, including:
- Better feature fusion: It helps to fuse features of inconsistent semantics between the Channel Transformer and U-Net decoder, which eliminates ambiguity in the decoder's features.
- Improved model performance: The fusion of different features leads to the model's improved performance across various tasks.
- Efficient use of channel-wise dependencies: The channel attention map generated by the module encodes channel-wise dependencies, which help to recalibrate or excite the transformer features accordingly.
- Generalizability: The module works well for various segmentation tasks, making it a flexible and generalizable solution.
Applications of Channel-wise Cross Attention
Channel-wise cross attention has several applications, including:
- Medical image segmentation: The module works well for medical image segmentation tasks and has been used in the segmentation of liver tumors and brain tumors, among others.
- Object detection: The module has also been applied in object detection tasks, where it helps to fuse features from the encoder and decoder networks, leading to improved model performance.
- Image recognition: The channel-wise cross-attention module has been used in image recognition tasks such as image classification and image localization.
Channel-wise cross-attention is a powerful module used in semantic segmentation tasks. It fuses features of inconsistent semantics between the Channel Transformer and U-Net decoder, which eliminates ambiguity in the decoder's features. The recalculation or excitations of transformer features based on channel-wise dependencies lead to improved model performance, making channel-wise cross attention a versatile and effective tool for various segmentation tasks.