DiCE Unit

DiCE Units are image model blocks that utilize dimension-wise convolutions and dimension-wise fusion to efficiently encode spatial and channel-wise information contained in an input tensor. These convolutional filtering techniques apply lightweight operations across each dimension of the input tensor, allowing for efficient encoding without the computationally intensive requirements of standard convolutions.

Improving Convolutional Efficiency

Standard convolutions function through the simultaneous encoding of spatial and channel-wise information, but are often too computationally expensive for practical application. To mitigate this issue, separable convolutions were introduced that encode spatial and channel-wise information individually using depth-wise and point-wise convolutions. However, this approach places a significant computational burden on point-wise convolutions and creates a computational bottleneck.

Dimension-wise Convolutions and Fusion

DiCE Units offer an alternative solution to the computational expense of standard convolutions and the bottleneck issue of separable convolutions. They use dimension-wise convolutions that independently encode depth-wise, width-wise, and height-wise information. These convolutions extract local information from various dimensions of the input tensor, but do not capture global information. To address this, DiCE Units employ dimension-wise fusion to factorize the point-wise convolution in two steps: local fusion and global fusion.

The Benefits of DiCE Units

The benefits of DiCE Units are multi-faceted. Compared to traditional convolutions, DiCE Units offer a lighter, more efficient approach for encoding information. They are also more computationally efficient than separable convolutions, as they do not place a significant computational burden on point-wise convolutions. Additionally, DiCE Units capture both local and global information in a comprehensive manner.

Overall, DiCE Units offer several advantages over traditional convolutions and separable convolutions. They efficiently encode spatial and channel-wise information without the same computational burden, and capture both local and global information for a more comprehensive analysis of input tensors.