Multiple Random Window Discriminator

Introduction to Multiple Random Window Discriminator in GAN-TTS

Multiple Random Window Discriminator (MRWD) is a part of the GAN-TTS text-to-speech architecture that evaluates audio in different ways. MRWD operates on randomly sub-sampled fragments of real or generated samples, which allows data augmentation and reduces computational complexity. The ensemble allows for the evaluation of audio in different complementary ways and yields ten discriminators by taking the Cartesian product of two parameter spaces. The MRWD reshapes the input raw waveform to a constant temporal dimension, and the conditional discriminators have access to linguistic and pitch features to measure whether the generated audio matches the input conditioning.

Why Use Multiple Random Window Discriminator

MRWD is used to create discriminators that evaluate the quality of audio generated in the GAN-TTS architecture. Using random windows of different sizes instead of a full generated sample has a data augmentation effect and reduces computational complexity. This way, all the RWDs have the same architecture and similar computational complexity despite different window sizes. Additionally, conditioned discriminators have access to linguistic and pitch features to measure whether the generated audio matches the input conditioning. Unconditioned discriminators, on the other hand, evaluate the realism of the audio regardless of conditioning. This method increases the amount of training data and allows audio to be evaluated in different complementary ways.

How MRWD Works

In the first layer of each discriminator, the MRWD reshapes the input raw waveform to a constant temporal dimension by moving consecutive blocks of samples into the channel dimension. For conditional RWDs, the input waveform is gradually downsampled by DBlocks until the temporal dimension of the activation is equal to that of the conditioning, at which point a conditional DBlock is used to pass joint information to the remaining DBlocks.

The discriminators use blocks, or DBlocks that are similar to the GBlocks used in the generator, but without batch normalization. The dilation factors in the DBlocks’ convolutions follow the pattern 1, 2, 1, 2. Unlike the generator, the discriminator operates on a relatively small window, and the authors did not observe any benefit from using larger dilation factors. The final output from the remaining DBlocks is average-pooled to obtain a scalar.

Multiple Random Window Discriminator is a part of the GAN-TTS text-to-speech architecture that evaluates audio in different ways. The use of random windows of different sizes instead of a generated sample augments data and reduces computational complexity, allowing for more efficient evaluation. MRWD also has access to linguistic and pitch features, which increases the amount of training data and allows audio to be evaluated in different complementary ways. DBlocks, which are similar to GBlocks, are used in the discriminators without batch normalization. Overall, MRWD is a beneficial tool in evaluating the quality of audio in the GAN-TTS architecture.