Global second-order pooling convolutional networks

GSoP-Net Overview: Modeling High-Order Statistics and Gathering Global Information

GSoP-Net is a deep neural network architecture that includes a Gsop block with a squeeze module and an excitation module. The GSoP block uses a second-order pooling technique to model high-order statistics and gather global information. This network architecture has been proven to be effective in various computer vision tasks, such as image classification and object detection.

The Squeeze Module

The squeeze module of a GSoP block reduces the number of channels of the input features by using a 1x1 convolution. This reduction is necessary because high-dimensional feature maps can be computationally expensive and hinder the learning process of the neural network.

After the reduction, the squeeze module computes a c' x c' covariance matrix for the remaining channels. The covariance matrix represents the relationship between different channels and their correlations. To understand this, consider a grayscale image. The different channels of the image would be its individual pixels. The covariance matrix would represent how each pixel relates to one another and how their values affect the image as a whole.

Next, the normalization is performed on the covariance matrix. This process allows the GSoP block to explicitly relate one channel to another. The final result is a normalized covariance matrix that has a dimension of c' x c'. This step helps the neural network learn high-level features by considering the correlations between different channels.

The Excitation Module

The excitation module of a GSoP block performs row-wise convolutions to maintain the structural information of the input features. It then applies a fully-connected layer and a sigmoid function to produce a c-dimensional attention vector. The attention vector represents the importance of each feature map and enhances the informative content of the input features.

Finally, the GSoP block multiplies the input features by the attention vector, which leads to a more refined feature representation. This process is similar to the Squeeze-and-Excitation (SE) block, which is a popular network component that has shown promising results on various computer vision tasks.

The Formulation of a GSoP block

A GSoP block can be mathematically formulated as:

s = F_gsop(X, θ) = σ(WRC(Cov(Conv(X))))

Y = s • X

Where Cov(•) computes the covariance matrix for the input features, Conv(•) reduces the number of channels, RC(•) performs row-wise convolution, and W and σ(•) represent the weight and activation functions, respectively. Y is the refined feature representation of the input features.

GSoP-Net is a promising neural network architecture that effectively models high-order statistics and gathers global information. It consists of a Gsop block with a squeeze module and an excitation module. The squeeze module reduces the number of channels of the input features, computes the covariance matrix, and normalizes it to explicitly relate one channel to another. The excitation module applies row-wise convolutions, a fully-connected layer, and a sigmoid function to produce a c-dimensional attention vector. Finally, the input features are multiplied by the attention vector to refine their feature representation. This architecture has demonstrated excellent performance on various computer vision tasks and has the potential to advance the field of deep learning further.