Channel Squeeze and Spatial Excitation (sSE)

Channel Squeeze and Spatial Excitation: Enhancing Image Segmentation

One of the challenges in computer vision is to accurately segment images, breaking them into different parts and identifying the objects they contain. Convolutional neural networks (CNNs) have been widely used for this task, achieving impressive results on various datasets. However, as these models become deeper and more complex, they often suffer from the vanishing gradients problem, leading to poor feature propagation and reduced segmentation quality. One way to address this issue is to use more efficient blocks that can recalibrate the feature maps for more fine-grained segmentation.

The SE Block: Spatial Squeeze and Channel Excitation

The spatial squeeze and channel excitation (SE) block is a widely used design for CNNs. It consists of two phases: the first phase, called channel squeeze, aims to reduce the feature map dimensionality and extract the most important features. The second phase, called spatial excitation, enhances the feature maps by considering their spatial interdependencies, i.e., how different spatial regions of the same feature maps are related to each other. This hierarchical approach allows the SE block to better capture the complex patterns present in images and to more accurately segment them.

The channel squeeze phase involves performing global average pooling (GAP) over each channel dimension, which reduces the feature map size to a single scalar, capturing the relative importance of different channels. This scalar is then used as a weight for each channel, which is multiplied by the original feature map, resulting in a compressed representation of the original features. The compressed features are then fed into the spatial excitation phase.

The spatial excitation phase computes a set of weights for each spatial location of the feature maps. This is done by applying two fully connected layers, followed by the sigmoid activation function, which results in a set of values between 0 and 1. These values represent the importance of each spatial location and are then multiplied by the channel-wise compressed features, resulting in feature maps that are recalibrated and enhanced. This process is repeated for all feature maps in the CNN.

The sSE Block: Combining SE with Spatial Squeeze and Excitation

While the SE block has been shown to be effective in improving image segmentation, it can be computationally expensive for deeper and wider CNNs, which require many SE blocks to be stacked together. Inspired by SE, researchers recently proposed the spatial squeeze and excitation (sSE) block that performs both channel squeeze and spatial excitation in a single phase.

The sSE block first applies a convolutional layer with a small kernel size to the input feature maps, followed by the channel squeeze and spatial excitation phases of the SE block. This approach enables the sSE block to better capture the spatial dependencies of the feature maps, while also reducing the number of parameters and computations needed, making it more computationally efficient than SE blocks.

The spatial squeeze phase in the sSE block is analogous to the channel squeeze phase in the SE block, performing GAP over the spatial dimensions of the compressed feature maps. This allows the sSE block to reduce the spatial dimensionality of the feature maps and extract the most important spatial features before applying the spatial excitation phase.

The spatial excitation phase in the sSE block computes a set of weights for each channel dimension in the compressed feature maps, similar to the SE block. However, instead of using fully connected layers, it uses convolutional layers with small kernel sizes, which allow the sSE block to capture the spatial dependencies of the feature maps while considering only local information.

The sSE block has been shown to perform well in various image segmentation tasks, including medical imaging, urban scene understanding, and natural image segmentation. It has also been used to enhance the feature maps in CNNs for other computer vision tasks, such as object detection and recognition.

The efficient design of CNNs is critical for accurate and efficient image segmentation. The SE and sSE blocks are two widely used designs that perform channel squeeze and spatial excitation, allowing CNNs to better capture the complex patterns present in images and more accurately segment them. While the SE block is effective, the sSE block provides a more efficient alternative, making it an attractive choice for implementing various computer vision tasks.