1x1 Convolution

What is 1x1 Convolution?

If you’ve heard about computer vision, convolutional neural networks (CNNs), or deep learning, you may have also come across the term 1x1 convolution. It is a type of convolution that differs from other convolutions, such as 3x3, 5x5, and 7x7, in its properties and uses. In this article, we’ll explore what 1x1 convolution is, how it works, and why it’s important in deep learning.

What is Convolution?

Before we dive into 1x1 convolution, let’s briefly review what convolution is. Convolution is a mathematical operation that involves two functions: a signal and a filter. The signal is usually an image or a set of images, represented as a matrix of pixel values. The filter is a smaller matrix of weights that slides or “convolves” over the signal, performing a dot product at each position and producing an output matrix of convolved values.

Convolution can be used for various purposes in image processing and computer vision, such as edge detection, feature extraction, segmentation, and classification. It has also been shown to be effective in deep learning models for image recognition, object detection, semantic segmentation, and other tasks.

What is 1x1 Convolution?

1x1 convolution, also known as pointwise convolution, is a special type of convolution that uses a filter with a size of 1x1. Unlike other convolutions that have larger filter sizes, 1x1 convolution operates on individual pixels of the input image. Specifically, it applies a linear transformation to each pixel independently across its channels.

Let’s consider an example. Suppose we have an input image of size 32x32x64, where 32x32 is the spatial size and 64 is the number of channels. We can apply a 1x1 convolution with 32 output channels, which means that the filter will have a size of 1x1x64x32. The resulting output image will have the same spatial size of 32x32 but with only 32 channels. The 1x1 convolution has essentially reduced the number of channels, or dimensions, of the image from 64 to 32, by applying a linear transformation to each pixel.

The main advantage of 1x1 convolution is its ability to perform dimensionality reduction in a computationally efficient manner. By reducing the number of channels, we can reduce the amount of computations needed for subsequent convolutions or fully connected layers, which can speed up the training and inference process of deep learning models.

How Does 1x1 Convolution Work?

The mathematical formulation of 1x1 convolution is similar to that of other convolutions, except that the filter size is 1x1. Let’s denote the input image as X with dimensions of HxWxC, where H is the height, W is the width, and C is the number of channels. We can also represent the filter as a weight matrix W with dimensions of CxF, where F is the number of output channels.

Then, the output image Y can be obtained by the following equation:

where “∗” denotes convolution, “+” denotes bias, and “σ()” denotes activation function. In practice, the bias and activation function may be omitted or replaced with other forms of normalization or non-linearity.

Intuitively, 1x1 convolution applies a linear transformation to each pixel of the input image independently, using a learned set of weights. The output of each pixel depends on the weighted sum of its neighboring pixels and their corresponding weights, as well as a bias term and an activation function. By stacking multiple 1x1 convolutions and other types of convolutions, we can build deep convolutional neural networks that can learn complex features and patterns in images.

Why is 1x1 Convolution Important?

1x1 convolution has several important applications in deep learning, in addition to dimensionality reduction. One of them is efficient low dimensional embeddings, which can be useful for transfer learning or embedding large datasets into a smaller space for visualization, clustering, or other purposes.

Another application of 1x1 convolution is applying non-linearity after other types of convolutions, such as 3x3 and 5x5 convolutions. The reason is that these convolutions are linear operations, meaning that they only perform linear combinations of pixel values without introducing non-linearity. By adding a 1x1 convolution with an activation function, we can introduce non-linearity to the activation maps and allow the network to learn more complex and diverse features.

Furthermore, 1x1 convolution can be viewed as an MLP (multi-layer perceptron) looking at a particular pixel location. An MLP is a type of feedforward neural network that consists of multiple layers of perceptrons, each of which applies a linear transformation followed by a non-linear activation function. By replacing the fully connected layers of an MLP with 1x1 convolutions, we can reduce the number of parameters and computations needed, while still retaining the ability to learn non-linear and spatially invariant features.

In summary, 1x1 convolution is a special type of convolution that enables dimensionality reduction, efficient low dimensional embeddings, and applying non-linearity after other convolutions. It operates on individual pixels of the input image, applying a linear transformation to each pixel independently across its channels. By stacking multiple 1x1 convolutions and other types of convolutions, we can build deep convolutional neural networks that can learn complex features and patterns in images. 1x1 convolution is an essential building block in modern computer vision and deep learning, and understanding its properties and uses can help us design more effective and efficient models.