Services – phAIstos

TEXT-TO-SPEECH (TTS)

For an overview of what TTS is and how it works, check out my blog post here. I plan to add more blog posts over time about various different aspects of the TTS process.

For the time being, my plan is to offer services to businesses who create TTS solutions, to solve research problems, deliver custom data sets or TTS models.

I deliver (among others) the following TTS services:
– Linguistic preprocessing
– Data collection / annotation
– Evaluation

Linguistic preprocessing

TTS relies on Natural Language Processing (NLP) to translate input text to input features for training an acoustic model.
My services include:
– Text normalization
– Homograph disambiguation
– Multilingual and cross-lingual NLP
– Custom pronunciation dictionaries
– Grapheme-to-phoneme (G2P) models
– LM finetuning for various tasks

Data collection / annotation

Like most AI processes, TTS relies on data to train neural models, in particular text and speech data. High-quality TTS needs high-quality data. Ethical concerns are very important to me, such that the collected data and its use for particular applications is done with full consent and adequate compensation of the parties involved.

In the past I have worked on collecting speech databases for many voices in many languages. There are many different aspects involved in designing a speech database for TTS including balancing the data for sentence types (questions vs statements), speaking styles / emotions (conversational style vs storytelling), intonational variation, cross-lingual phenomena, and more.

Evaluation

There are many ways to evaluate the output of a TTS system. Nowadays, it is common to compute the Word Error Rate (WER) using automatic speech recognition and use automatic MOS prediction to quantify the naturalness of the speech.

But there is no set evaluation framework yet and there are many considerations regarding ‘naturalness’. Is the generated speech appropriate for the intended application? What kind of errors influence our perception of the naturalness? What specific areas of improvement can be identified?