Are you tired of robotic-sounding text-to-speech models? Look no further than FastPitch - a state-of-the-art, fully-parallel model based on FastSpeech that produces natural-sounding speech by conditioning on fundamental frequency contours.
What is FastPitch?
FastPitch is a text-to-speech model that utilizes FastSpeech architecture and two feed-forward Transformer (FFTr) stacks to produce high-quality, natural-sounding speech. Unlike other text-to-speech models, FastPitch is fully-parallel, which allows for faster and more efficient computations.
How Does FastPitch Work?
FastPitch is composed of two FFTr stacks that operate in different resolutions - the first stack operates in the resolution of input tokens, while the second stack operates in the resolution of the output mel-scale spectrogram frames.
The first FFTr stack produces a hidden representation $\mathbf{h}$ of the input sequence $\left(x_{1}, \ldots, x_{n}\right)$ using the equation:
$$\mathbf{h}=\operatorname{FFTr}(\mathbf{x})$$
The hidden representation $\mathbf{h}$ is used to predict the duration $\hat{\mathbf{d}}$ and average pitch $\hat{\mathbf{p}}$ of each character using a 1-D CNN. The pitch prediction is then projected to match the dimensionality of $\mathbf{h}$ and added to produce a sum $\mathbf{g}$.
The resulting sum $\mathbf{g}$ is discretely upsampled and passed to the output FFTr stack, which produces the output mel-spectrogram sequence $\hat{\mathbf{y}}$ using the equation:
$$\hat{\mathbf{y}}=\operatorname{FFTr}\left([\underbrace{g_{1}, \ldots, g_{1}}_{d_{1}}, \ldots \underbrace{g_{n}, \ldots, g_{n}}_{d_{n}}]\right)$$
During training, FastPitch uses ground-truth $\mathbf{p}$ and $\mathbf{d}$ to optimize mean-squared error (MSE) between the predicted and ground-truth modalities:
$$\mathcal{L}=\|\hat{\mathbf{y}}-\mathbf{y}\|_{2}^{2}+\alpha\|\hat{\mathbf{p}}-\mathbf{p}\|_{2}^{2}+\gamma\|\hat{\mathbf{d}}-\mathbf{d}\|_{2}^{2}$$
During inference, FastPitch uses predicted $\hat{\mathbf{p}}$ and $\hat{\mathbf{d}}$ to generate natural-sounding speech.
What are the Advantages of FastPitch?
FastPitch has several advantages over other text-to-speech models:
- Natural-sounding speech: FastPitch produces high-quality, natural-sounding speech that closely resembles human speech.
- Fully-parallel architecture: FastPitch is fully-parallel, which allows for faster and more efficient computations.
- Conditioned on fundamental frequency contours: FastPitch is conditioned on fundamental frequency contours, which is a more natural way of producing speech and allows for greater control over the generated speech.
- Easy to use: FastPitch can be easily integrated into existing applications and does not require extensive knowledge of machine learning.
Applications of FastPitch
FastPitch has several applications in various industries:
- Voice assistants: FastPitch can be used to produce natural-sounding speech for voice assistants like Amazon Alexa and Google Assistant.
- Accessibility: FastPitch can be used to produce natural-sounding speech for individuals with speech disabilities.
- Voice-over industry: FastPitch can be used to produce natural-sounding voice-overs for movies, TV shows, and commercials.
FastPitch is a state-of-the-art text-to-speech model that produces high-quality, natural-sounding speech. Its fully-parallel architecture, conditioning on fundamental frequency contours, and ease of use make it an attractive option for various industries that require natural-sounding speech production. Whether you're a voice assistant developer, a film producer, or an individual with speech disabilities, FastPitch has the potential to improve your life and work. Try it out today!