Microsoft has recently announced that they will limit their access to its neural text-to-speech AI called Custom Neural Voice. This AI-powered service allows developers to create custom synthetic voices.
Custom Neural Voice is a Text-to-Speech (TTS) in Azure Cognitive Services. It allows users to create a customized synthetic voice suited for their brand. Since its preview in September, last year, the feature has successfully helped several customers such as AT&T, Progressive, Duolingo, and Swisscom to develop speech solutions.
The feature is generally available (GA); however, customers are yet to be given access to Custom Neural Voice including technical controls.
Microsoft’s technology for Custom Neural Voice consists of three major components: Neural Acoustic Model, Text Analyzer, and Neural Vocoder.
- Neural Acoustic Model predicts acoustic features that define speech signals, such as timbre, speaking style, speed, intonations, and stress patterns.
- Text Analyzer is responsible for generating natural and synthetic speech from text.
- Neural Vocoder converts the acoustic features into audible waves to generate synthetic speech.
Neural TTS models are used to train deep neural networks based on samples of real voice recording. Customers can even adapt to the Neural TTS engine with Custom Neural Voice’s customization capability to fit their user scenarios better.
Customers to benefit will have different use cases from the Custom Neural Voice, such as customer service chatbots, online learning assistants, public service announcements, audiobooks, and real-time translations.
In an Azure AI blog post, Qinying Liao, principal program manager at Microsoft, says “Empowered with this technology, Custom Neural Voice enables users to build highly-realistic voices with just a small number of training audios.”
“This new technology allows companies to spend a tenth of the effort traditionally needed to prepare training data while at the same time significantly increasing the naturalness of the synthetic speech output when compared to traditional training methods.”
Holger Mueller, principal analyst and vice president at Constellation Research Inc., says “To make computers more human, speech is a crucial ingredient, and in 2020 enterprises need to depart from the robotic and standardized voices, accents of synthetic speech in the past.”
“The cloud enables this level of personalized creation of personalized voice experience – with availability, cheap computing, and operational capacity. So, it is a widespread use case across the IaaS / PaaS players – and suitable for enterprises and their customers, and even employees as they get a more human experience.”
Lastly, Microsoft also offers more than 200 neural and standard voices covering 54 languages and locales besides the capability to customize TTS voice models.