Phase Shuffle

Phase shuffle is a technique used in audio generation models to remove pitched noise artifacts which are a common occurrence while using transposed convolutions. This technique involves random perturbations of the phase of each layer's activations by -n to n samples before they are input to the next layer.

What is Phase Shuffle?

Phase Shuffle is a technique used in audio generation models. It is a process of randomized perturbation of the phase of each layer’s activations by -n to n samples before the input to the following layer. Its main objective is to remove pitched noise artifacts, which occurs frequently while using transposed convolutions.

Why is Phase Shuffle Necessary?

The use of transposed convolutions in audio generation models can lead to some pitched noise artifacts in the synthesized audio output. To remove these artifacts, phase shuffle is introduced. It randomizes the phase of each layer’s activations by altering them by a small amount in either direction. This results in a smoother output waveform that is free of pitched noise artifacts.

How Does Phase Shuffle Work?

Phase shuffle works by randomly perturbing the phase of each layer’s activations by -n to n samples before they are input to the next layer. This random perturbation effectively "shuffles" the phase information of the signal, which makes it more challenging for the discriminator to determine the phase of the waveform. By making it harder for the discriminator to determine the phase, it forces it to focus on other features of the signal, thereby improving the overall quality of the output waveform.

The process of phase shuffle can be divided into two steps. Firstly, the activation functions of each layer are split into their real and imaginary components. Then, the phase information of these components is randomly perturbed by adding a value between -n to n. Finally, the imaginary and real components are combined using the inverse Fourier transform for further processing.

Where is Phase Shuffle Used?

Phase Shuffle was first introduced in WaveGAN. In this application, the authors only applied phase shuffle to the discriminator. It was because the latent vector provided the generator with a mechanism to manipulate the phase of the audio output waveform. Applying phase shuffle to the discriminator made its job more challenging by requiring invariance to the phase of the input waveform. This made the generated audio output more accurate and free of pitched noise artifacts.

Phase shuffle can also be used in other audio generation models such as Generative Adversarial Networks (GAN) and Variational Autoencoders (VAE). In GAN, phase shuffle can be used to remove noise artifacts that may occur when generating high-resolution images or video. In VAE, it can be used to generate more realistic speech or music that closely resembles the input data.

Phase shuffle is a technique that can be used to improve the quality of audio generated by transposed convolutional models. It involves perturbing the phase of each layer’s activation by a random amount, which makes it harder for the discriminator to detect the phase of the input waveform. This results in more realistic outputs, free of pitched noise artifacts. Its applications are not limited to WaveGAN but can also be used in other audio generation models to generate more accurate and high-quality output.