Amazon Polly

Amazon Polly is a cloud-based text-to-speech service developed by Amazon Web Services that enables users to create high-quality, lifelike speech in over 30 languages, leveraging machine learning technology to deliver natural sounding tones. The service provides users with customizable and easily controllable speech output, including access to standard or neural voices and support for Speech Synthesis Markup Language (SSML) tags.

Amazon Polly's metadata streaming feature recognizes and pronounces particular sections designated within larger texts, allowing for greater synchronization with accompanying graphical visualizations or supplementary sound effects. The service's interface is easy to use, making it simple for users to quickly integrate Amazon Polly's API into their applications.

Amazon Polly also features a customizability feature called a "Brand Voice", which allows businesses to create their own unique voices for implementing in their Alexa Skills or Amazon Connect experiences.

TLDR

Amazon Polly is a cloud-based, AI-powered text-to-speech service that allows users to create high-quality, natural-sounding speech output in over 30 languages. It is comprised of features such as support for standard or neural voices, Speech Synthesis Markup Language (SSML), metadata streaming, and customizability allowing for bespoke audio output. The service is affordable and simple to use, with a user-friendly API integration and tools for specialized business use such as the creation of unique, branded voices.

Alternatives to Amazon Polly in the text-to-speech market include Google Text-to-Speech, Microsoft Azure Text-to-Speech, IBM Watson Text-to-Speech, NaturalReader, and ReadSpeaker.

Company Overview

Amazon Polly is an AI-powered text-to-speech service that enables businesses to deploy high-quality, natural-sounding human voices in dozens of languages. With a wide range of features and customization options, Amazon Polly is an essential tool for businesses that need to incorporate speech into their applications for a global audience. The service is provided by Amazon Web Services, a subsidiary of Amazon.com that provides cloud computing services to businesses of all sizes.

One of the key benefits of Amazon Polly is that it offers 5 million characters free per month for 12 months with the AWS Free Tier. This makes it easy for businesses to get started with the service and experiment with different use cases without incurring any costs. In addition, Amazon Polly allows businesses to customize and control speech output that supports lexicons and Speech Synthesis Markup Language (SSML) tags, making it easy to tailor the service to their specific needs.

Another advantage of Amazon Polly is that it enables businesses to store and redistribute speech in standard formats like MP3 and OGG. This makes it easy to add speech to applications with a global audience, such as RSS feeds, websites, or videos. Amazon Polly also supports the ability to store and replay speech output to prompt callers through interactive or automated voice response systems, making it an ideal solution for businesses that need to deliver lifelike voices and conversational user experiences in consistently fast response times.

Amazon Polly also supports the use of SSML, a W3C standard XML-based markup language for speech synthesis applications, to support common SSML tags for phrasing, emphasis, and intonation. This enables businesses to deliver advanced improvements in speech quality and create more nuanced and engaging speech output.

Overall, Amazon Polly is a powerful and flexible tool for businesses that need to add high-quality, natural-sounding speech to their applications. With a variety of customization options and support for dozens of languages, Amazon Polly is an essential tool for any business that wants to stay ahead of the competition and deliver better experiences for their customers.

Features

Speech Synthesis

Dozens of Lifelike Voices

Amazon Polly provides access to a variety of lifelike voices that can be used to enhance your application's user experience. You can choose from over 30 different voices, including Standard TTS voices and Neural Text-to-Speech (NTTS) voices. These voices are designed to sound natural and human-like, making them ideal for use in a variety of industries such as e-learning, gaming, and more.

Some of the popular Standard TTS voices include Nicole, Emma, and Joanna, while some of the Neural voices include Olivia, Vitória, and Camila.

Support for Multiple Languages

Amazon Polly also supports a variety of languages, making it ideal for use in a global marketplace. The tool can support a variety of languages, including English, Spanish, French, German, Italian, Portuguese, Korean, and Japanese, among others.

Speech Synthesis Markup Language (SSML)

Amazon Polly also supports Speech Synthesis Markup Language (SSML), allowing developers to create more sophisticated speech output for their applications. SSML provides a way to mark up text with information about the pronunciation, intonation, and other nuances of the text, giving developers more control over how the text is spoken. This flexibility helps developers create more lifelike speech that is tailored to their specific use cases.

Metadata Streaming

Speech Mark Integration

Amazon Polly makes it easy to request an additional stream of metadata that provides information about when particular sentences, words, and sounds are being pronounced. This metadata stream can be combined with the synthesized speech audio stream to build enhanced visual experiences such as speech-synchronized facial animation or karaoke-style word highlighting.

The metadata is streamed in near real-time, allowing for a more responsive and interactive experience.

Time-Driven Prosody

Amazon Polly enables users to automatically adjust the speech rate based on a maximum allotted amount of time defined with a feature called Time-Driven Prosody. This feature is essential for many use cases, especially when it comes to localization.

For example, suppose you have US English speech embedded in your training video and want to localize this video into German where the German speech cannot be longer than the US English speech. You can use this feature to more easily facilitate the dubbing process.

Audio Stream Formats

Amazon Polly supports various audio stream formats, including MP3, Vorbis, and raw PCM. You can select from different sampling rates that optimize bandwidth and audio quality for your application, ensuring a high-quality user experience.

Integration

API Integration

Amazon Polly provides an API that enables users to quickly integrate speech synthesis into their application. You simply send the text you want to convert into speech to the Amazon Polly API, and the tool immediately returns the audio stream to your application so that it can begin streaming it directly or store it in a standard audio file format, such as MP3.

Additionally, it supports all the programming languages included in the AWS SDK, including Java, .NET, PHP, Python, Ruby, Go, and C++, as well as AWS Mobile SDK (iOS/Android), an HTTP API, and AWS Management Console, which provides users with full control over all the capabilities of Amazon Polly.

Customization

Amazon Polly offers a high degree of customization that enables users to modify the pronunciation of particular words, such as company names, acronyms, foreign words, and neologisms. You create a custom XML file with lexical entries that modify the pronunciation of words, and Amazon Polly generates the corresponding speech output.

This feature allows users to create truly bespoke speech output for their applications.

Brand Voice

Amazon Polly offers a unique engagement option called Brand Voice, which allows organizations to differentiate themselves further by creating a Neural Text-to-Speech (NTTS) voice for their exclusive use. The tool helps throughout the entire process of creating the persona, identifying an actor or actress, recording their speech, and ultimately building and training a model to produce the voice. The unique voice is then made available to your AWS account ID(s).

This feature is available for text-to-speech in any use case in both Amazon Connect and Alexa Skills integrations.

Pricing

Amazon Polly offers flexible and affordable pricing options. With Amazon Polly, you only pay for what you use, based on the number of characters of text that you convert either to speech or to Speech Marks metadata. In addition, you can cache and replay Amazon Polly's generated speech at no additional cost.

You can get started with the Amazon Polly Free Tier at no charge, and if you require additional usage, Amazon Polly's Standard voices are priced at $4.00 per 1 million characters for speech or Speech Marks requests. For Neural voices, you will be charged $16.00 per 1 million characters for speech or Speech Marks requests. However, the free tier includes 5 million characters per month for Standard voices and 1 million characters per month for Neural voices, for the first 12 months, starting from your first request for speech.

For government customers, Amazon Polly offers Standard voices in the AWS GovCloud (US) region, which are priced at $4.80 per 1 million characters for speech or Speech Marks requests for the first 12 months, starting from your first request for speech, after which standard pricing applies. Neural TTS voices are priced at $19.20 per 1 million characters for speech or Speech Marks requests.

To give you an idea of pricing, $0.04 covers 2.5k characters of synthesized speech or 2.5k characters of Speech Marks data. For 10k characters of synthesized speech or Speech Marks data, you will be charged $0.16.

You can easily calculate your monthly costs with AWS, and if you require a personalized quote or assistance with getting started, you can contact AWS specialists for help.

FAQ

What is Amazon Polly?

Amazon Polly is an AI service that converts text into lifelike speech, enabling applications to speak like a human. The service offers dozens of lifelike voices in various languages, and you can select your preferred voice and distribute your speech-enabled applications globally.

Amazon Polly immediately returns an audio stream to your application that can be played directly or stored in MP3 format. With the support of Speech Synthesis Markup Language (SSML) tags, you can adjust the speech rate, pitch, volume, and more to make it sound more natural. Amazon Polly is a secure, easy-to-use, and cost-effective solution that allows you to convert millions of characters per month for free during the first year upon sign-up.

Why should I use Amazon Polly?

You should use Amazon Polly to create high-quality spoken output for your custom application. The service is cost-effective, offers fast response times, and has no restrictions on storing and reusing generated speech. You can leverage Amazon Polly to add speech capabilities to various solutions, such as E-learning, smart devices, telephony, smart transportation systems, industrial control systems, and many more.

What features are available?

Amazon Polly offers various features, including standardized SSML tags that enable you to adjust parameters such as pronunciation, volume, pitch, rate, and more. Neural voices in Newscaster-style are available to make the speech sound like a TV or radio newscaster.

Additionally, Amazon Polly allows you to synchronize graphical highlighting and animations with synthesized speech by detecting when specific words or sentences are spoken based on their metadata. Custom lexicons can also be used to modify the pronunciation of specific words, acronyms, company names, foreign words, neologisms, and more.

What are Speech Marks?

Speech Marks are a metadata feature that enables you to complement synthesized speech with visual experiences such as speech-synchronized animation or karaoke-style highlighting. Amazon Polly generates Speech Marks in the form of a JSON stream, which contains one to four different elements: word timing, sentence timing, viseme timing, and phoneme timing. By using Speech Marks with the synthesized speech audio stream, you can create more intuitive and user-friendly experiences.

What are the most common use cases for this service?

Amazon Polly can be used in various fields, such as E-learning and education, multimedia, industrial control systems, transportation systems, telephony, smart devices, and many more. Some of the most common use cases include helping people with reading disabilities consume digital content, providing self-service cloud-based contact center services, generating speech for announcements, creating voice interfaces for applications, and enabling narration generation for quiz games, animations, and avatars.

Alternatives

If you're looking for an AI text-to-speech tool similar to Amazon Polly, here are some alternatives to consider:

Google Text-to-Speech

Google Text-to-Speech is an AI-powered tool that converts written text into spoken words. It's available for free on Android devices and can be customized to use different languages, accents, and voice options. With natural-sounding voice quality and high accuracy, Google Text-to-Speech is a great choice for anyone in need of a reliable text-to-speech conversion tool.

Microsoft Azure Text-to-Speech

Microsoft Azure Text-to-Speech is another AI-powered text-to-speech tool that provides realistic, high-quality voice synthesis. It offers a variety of customization options, including controlling the voice volume, pitch, and rate of speech. Microsoft Azure Text-to-Speech is available in over 100 languages and dialects and can be used to create natural-sounding voiceovers for a wide range of applications.

IBM Watson Text-to-Speech

IBM Watson Text-to-Speech is a comprehensive AI-powered text-to-speech solution that provides natural-sounding voice synthesis in a variety of languages and accents. With customizable voices and high-quality audio output, IBM Watson Text-to-Speech is ideal for creating voiceovers for videos, audiobooks, and other media. It also includes advanced features such as real-time translation between languages, making it a versatile tool for a wide range of applications.

NaturalReader

NaturalReader is a text-to-speech software that reads aloud any text in a natural-sounding voice. It supports over 60 languages and accents and allows users to adjust the speed, pitch, and volume of the voice. NaturalReader also offers a user-friendly interface and integration with other software, making it a popular choice for professionals and individuals alike.

ReadSpeaker

ReadSpeaker is another AI-powered text-to-speech solution that provides lifelike voice synthesis. It offers a web-based platform that can be integrated with various applications and services, including websites, mobile apps, and call centers. ReadSpeaker supports over 25 languages and voices and can even turn written text into an audio file for offline use.

Published by

Devin Schumacher

Play.ht

public – 8 min read

Play.ht is a startup that focuses on using state-of-the-art AI technology to provide high-quality text-to-speech synthesis and audio accessibility…

May 4, 2023

Article.Audio

public – 9 min read

Article.Audio is an AI-powered tool that converts written content into high-quality audio versions, allowing users to listen to articles…

May 4, 2023

Balacoon

public – 6 min read

Balacoon is a company that focuses on providing accessible and user-friendly tools and resources for Text-to-Speech (TTS). Their aim is…

May 4, 2023

Bark AI: Text-to-Speech Artificial Intelligence Voice Cloning App & Text-Prompted Generative Audio

public – 4 min read

🎁Get our BARK Text-to-Speech Model Free at the bottom of this post! Bark is a revolutionary text-to-audio model created by…

Apr 23, 2023

Wavel AI

public – 10 min read

Wavel AI is a powerful AI platform that offers a range of innovative solutions to localization needs. The platform allows…

Apr 23, 2023

Voiser

public – 6 min read

Voiser is an AI tool that provides natural-sounding text-to-speech and speech-to-text solutions for various industries. Using advanced natural language processing,…

Apr 23, 2023

Speechllect

public – 7 min read

Speechllect is an innovative AI tool that allows users to automate work processes by transcribing speech and synthesizing text with…

Apr 23, 2023

Speechelo

public – 6 min read

Speechelo is an online text to speech AI software that generates human-like voices with natural expressions and intonations. It offers…

Apr 23, 2023

Speech

public – 8 min read

Speech by Resemble AI is a state-of-the-art AI voice tool that allows users to create custom AI voices in multiple…

Apr 23, 2023

Spakfly

public – 8 min read

Spakfly provides a versatile text-to-speech (TTS) service that allows users to create voiceovers for various purposes such as marketing videos,…

Apr 23, 2023

Replica

public – 9 min read

Replica is an AI voice technology company that offers natural-sounding AI-generated voice actors and realistic text-to-speech tools. Their platform provides…

Apr 23, 2023

NaturalReader

public – 7 min read

NaturalReader is an innovative technology company that offers speech solutions for software, web, and mobile applications. Founded by Jeff Yang…

Apr 23, 2023

Narration Box

public – 10 min read

Narration Box is an AI-powered tool that enables users to create ultra-realistic voiceovers and narrations easily. With its advanced Text-to-Speech…

Apr 23, 2023

Free Text

public – 8 min read

Free Text is an innovative company that uses the Microsoft AI speech library to synthesize customized reading audio that sounds…

Apr 23, 2023

ElfMessages

public – 5 min read

ElfMessages is an AI-powered tool that allows parents to create personalized audio recordings to make Christmas even more magical for…

Apr 23, 2023

DeepZen

public – 8 min read

DeepZen is an AI technology company that has revolutionized the audio industry by offering personalized and high-quality solutions. Their proprietary…

Apr 23, 2023

Createaivoiceovers

public – 9 min read

Createaivoiceovers is a text-to-speech system that employs the latest synthetic speech technology to create high-quality AI voice that mimics the…

Apr 23, 2023

Big Speak

public – 8 min read

BigSpeak is an intelligent AI tool that enables voice-to-text transformation with unmatched accuracy. The platform supports various languages, including Japanese,…

Apr 23, 2023

Eden AI

public – 9 min read

Eden AI empowers businesses to solve complex problems and create new opportunities for growth and innovation by making the power…

Apr 23, 2023

SpeechEasy

public – 9 min read

SpeechEasy is an AI-powered tool that provides an easy and straightforward solution to convert text into high-quality audio. With nearly…

Apr 23, 2023

Revoicer

public – 6 min read

Revoicer is a powerful AI-powered text-to-speech tool that aims to provide businesses and individuals with a cost-efficient and scalable alternative…

Apr 23, 2023

SpeechGen

public – 9 min read

SpeechGen is an AI-powered text-to-speech tool that provides hundreds of natural-sounding voices in different languages and dialects. With an easy-to-use…

Apr 23, 2023

Lovo

public – 10 min read

LOVO is an AI-powered voice generator and text-to-speech tool used to create premium content. Its platform provides natural language processing,…

Apr 23, 2023

VERBATIK

public – 6 min read

Verbatik is a versatile and intuitive text-to-speech tool that uses advanced AI and machine learning algorithms to produce natural-sounding TTS…

Apr 23, 2023