FastSpeech 2s

FastSpeech 2s is an innovative text-to-speech model that generates speech directly from text during inference. This means that it skips mel-spectrogram generation and goes directly to waveform generation, making it a more efficient system. FastSpeech 2s has made two main design changes to the waveform decoder that have improved the model's capability.

Main Design Changes

The first major change that FastSpeech 2s has made is the use of adversarial training. Due to the difficulty of predicting phase information using a variance predictor, adversarial training is employed in the waveform decoder to help it implicitly recover phase information independently. This new approach allows FastSpeech 2s to train more effectively and generate high-quality speech audio.

The second major change is leveraging the mel-spectrogram decoder of FastSpeech 2, which is trained on the full text sequence to help with text feature extraction. This approach allows FastSpeech 2s to have a more effective waveform decoder that can take a shorter audio clip and upscale it to match the length of the audio clip, making it more compact during inference.

Waveform Decoder

The waveform decoder of FastSpeech 2s is based on the structure of WaveNet, which includes non-causal convolutions and gated activation. The system takes a sliced hidden sequence corresponding to a short audio clip and upsamples it with transposed 1D-convolution to match the length of the audio clip. FastSpeech 2s also uses the same structure in Parallel WaveGAN for its discriminator in adversarial training, which consists of ten layers of non-causal dilated 1-D convolutions with leaky ReLU activation function. This results in a more efficient and effective waveform decoder that can generate high-quality speech audio.

Inference

During inference, FastSpeech 2s discards the mel-spectrogram decoder and only uses the waveform decoder to synthesize speech audio. This approach is more compact, efficient, and results in high-quality speech audio.

FastSpeech 2s is a revolutionary text-to-speech model that relies on waveform generation instead of mel-spectrogram generation for more efficient and compact inference. The model has made significant strides in waveform decoder design, particularly in leveraging the mel-spectrogram decoder of FastSpeech 2 for text feature extraction and the use of adversarial training to improve waveform decoder training. Ultimately, FastSpeech 2s is a powerful tool for generating high-quality speech audio from text and is sure to make a significant impact in the field of Natural Language Processing.