StyleSwin: Transformer-based GAN for High-resolution Image Generation

StyleSwin: Transforming High-Resolution Image Generation with Transformers

In recent years, there has been a surge of interest in generative models, specifically in high-resolution image synthesis. Convolutional neural networks (ConvNets) have been widely used in image generation tasks with remarkable success. However, Transformers, a class of neural networks originally designed for natural language processing, have not yet demonstrated their full potential in high-resolution image generative modeling. In this paper, we propose a method named StyleSwin to explore the use of pure transformers in a generative adversarial network (GAN) for high-resolution image synthesis.

Methodology

The proposed generator in StyleSwin adopts Swin transformer in a style-based architecture. We hypothesized that local attention is key in striking the balance between computational efficiency and modeling capacity. To achieve a larger receptive field, we propose double attention, which concurrently leverages the context of the local and shifted windows, leading to improved generation quality. Furthermore, we found that offering the knowledge of the absolute position, which is lost in window-based transformers, greatly benefits the generation quality. One challenge we faced was blocking artifacts occurring during high-resolution synthesis. This occurs because local attention is performed in a block-wise manner, which may break spatial coherency. We investigated various solutions, eventually finding that using a wavelet discriminator to examine spectral discrepancy effectively suppresses these artifacts.

Results

Extensive experiments were conducted on high-resolution images, with resolutions up to 1024x1024. StyleSwin was found to greatly excel over prior transformer-based GANs, especially on high resolutions. Without any complex training strategies, StyleSwin excelled over StyleGAN on CelebA-HQ 1024x1024 and achieved on-par performance on FFHQ 1024x1024. This promising result indicates that transformers can be used for high-resolution image generation.In summary, we proposed StyleSwin, a method that combines the strengths of transformers and GANs for high-resolution image synthesis. The proposed generator adopts Swin transformer in a style-based architecture and uses double attention to simultaneously leverage the context of the local and shifted windows, leading to improved generation quality. The use of a wavelet discriminator further suppressed blocking artifacts that occurred during high-resolution synthesis. StyleSwin has proved its superiority over prior transformer-based GANs, establishing the potential of using transformers for high-resolution image generation. Future work will involve testing StyleSwin on more diverse datasets and exploring other techniques to improve its performance.

Great! Next, complete checkout for full access to SERP AI.
Welcome back! You've successfully signed in.
You've successfully subscribed to SERP AI.
Success! Your account is fully activated, you now have access to all content.
Success! Your billing info has been updated.
Your billing was not updated.