GBlock

What is GBlock?

GBlock is a type of residual block that is used in the GAN-TTS text-to-speech architecture. The purpose of GBlock is to assist the generator in producing raw audio, with the receptive field of G large enough to capture long-term dependencies. In a GBlock, dilated convolutions are used to ensure the audio sequence contains 48000 samples, or a 2s training clip.

How Does GBlock Work?

A GBlock is a stack of two residual blocks. There are four kernel size-3 convolutions used in each GBlock, with increasing dilation factors of 1, 2, 4, 8. Convolutions are preceded by Conditional Batch Normalisation, conditioned on the linear embeddings of the noise term z~N(0, I128) in the single-speaker case or the concatenation of z and a one-hot representation of the speaker ID in the multi-speaker case. These embeddings are different for each BatchNorm instance.

The purpose of having different dilated convolutions is to ensure the receptive field is large enough to process long sequences of data. As the generator is producing raw audio, it needs to be able to capture long-term dependencies in the audio sequence.

The Skip Connections in a GBlock

A GBlock contains two skip connections. The first skip connection in the GAN-TTS architecture performs upsampling if the output frequency of the audio is higher than the input frequency. The second skip connection is a size-1 convolution if the number of output channels is different from the input. The skip connections are important for streamlining the information processing and making the system more efficient.

Conditional Batch Normalization

Conditional Batch Normalization is used in GBlock as a preconditioning step. It is a technique that applies batch normalization based on not only the input data but also the condition. In this case, the condition is the noise term z~N(0, I128). Conditional Batch Normalization is conditioned on the embeddings of the noise term to assist in the performance of the GBlock. The embeddings are different for each BatchNorm instance, meaning that each instance has unique embeddings to improve the system's accuracy.

Applications of GBlock

GBlock is a vital component of GAN-TTS architecture, which is useful in text-to-speech applications. It can be used to create synthetic voices, which can be valuable in the industry of virtual assistants such as Siri, Alexa or Cortana. Additionally, GBlock can be used in other applications that require processing large amounts of sequential data, such as speech recognition or computer vision tasks.

GBlock is a type of residual block used in GAN-TTS. It helps the generator produce raw audio with a receptive field large enough to capture long-term dependencies. The dilated convolutions and Conditional Batch Normalization are techniques used to improve the accuracy of the GBlock. In addition, the skip connections streamline the information processing and make the system more efficient. GBlock is a valuable component in the processing of large amounts of sequential data, and its applications are numerous in fields such as speech recognition, computer vision and virtual assistants.