Evaluation of Speech Synthesis

The number of companies that are producing AI Voices (also called Text-to-Speech or TTS or Speech Synthesis Voices) has grown astronomically over the last few years. Thanks to state-of-the-art machine learning techniques, LLMs, and heaps of training data, they can produce voices that are almost indistinguishable from human speech.

In the past, the subjective quality of TTS voices was often evaluated using intelligibility ratings and naturalness ratings using either a Mean Opinion Score (MOS) or Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) paradigm. More recently, MOS scores have been automatically predicted using a neural model trained on many different MOS tests. The problem is that it is difficult to compare different MOS tests, as they are used for different tasks, asking different questions, use different TTS systems and different listener groups. Moreover, they have reached a level of saturation where listeners can’t distinguish between human-produced speech and AI-produced speech. As indicated by Wagner et al. (2019), there is still a need for subjective evaluation, but it is time to evolve beyond intelligibility and naturalness.

The Blizzard Challenge has benchmarked progress in Text-to-Speech (TTS) since 2005. TTS researchers participating in the Blizzard Challenge use common data sets to build TTS systems whose output is then evaluated. In the past, MOS tests were the only way to compare the systems’ naturalness In 2023, at the Blizzard Challenge in Grenoble, France, the organizers incorporated state-of-the-art speech synthesis evaluation techniques to refine the evaluation of speech quality, speaker similarity and intelligibility. Perrotin et al. (2025) wrote a more elaborate journal paper about it in Computer Speech & Language, which I highly recommend reading.

In January 2025, I had the pleasure of attending a Dagstuhl seminar at Schloss Dagstuhl in Germany, organized by Petra Wagner, Jens Edlund, Christina Tånnander, and Sebastien Le Maguer, titled “Task and Situation-Aware Evaluation of Speech and Speech Synthesis“. For 2.5 days we were together with a group of 25 people to brainstorm about speech synthesis evaluation and how different use cases such as use for conversational agents, robots, audiobooks, education, clinical research, impact the evaluation experiments. It was a very valuable experience. We have already published a short paper on ArXiv (Cooper et al. 2025) for reviewers of speech synthesis papers to use as a reference when reading papers containing subjective evaluations and we hope to expand it as a useful guide for authors as well. We are planning to release one or more publications in the future to disseminate our findings and hopefully help to shape the discussion.

Furthermore, I hope I will have the opportunity to work with companies producing AI voices, to help design and conduct subjective evaluation experiments.

Dagstuhl participants from top left to bottom right: David Traum, Simon King, Jens Edlund, Olivier Perrotin, Bernd Möbius, Roger Moore, Erica Cooper, Elisabeth André, me (Esther Klabbers), Christina Tånnander, Sophia Strömbergsson, Zofia Malisz, Sebastien Le Maguer, Benjamin Cowan, Junichi Yamagishi, Naomi Harte, Petra Wagner, Fritz Seebauer, Gérard Bailly, Ayushi Pandey, Yusuke Yasuda, (not pictured Sebastian Möller).

References

Wagner, Petra, Jonas Beskow, Simon Betz, Jens Edlund, Joakim Gustafson, Gustav Eje Henter, Sébastien Le Maguer, et al. “Speech Synthesis Evaluation—State-of-the-Art Assessment and Suggestion for a Novel Research Program.”, Speech Synthesis Workshop 10, Vienna, Austria, 2019 https://www.isca-archive.org/ssw_2019/wagner19_ssw.pdf

Perrotin, Olivier, Brooke Stephenson, Silvain Gerber, Gérard Bailly, and Simon King. “Refining the evaluation of speech synthesis: A summary of the Blizzard Challenge 2023.” Computer Speech & Language 90 (2025) https://www.sciencedirect.com/science/article/pii/S088523082400130X

Cooper, Erica, Sébastien Le Maguer, Esther Klabbers, and Junichi Yamagishi. “Good practices for evaluation of synthesized speech.” (2025) https://arxiv.org/pdf/2503.03250

Evaluation of Speech Synthesis

Published by Esther on April 25, 2025April 25, 2025

0 Comments

Leave a Reply Cancel reply

Uncategorized

Homograph Disambiguation

Uncategorized

Interspeech 2025 Rotterdam

Uncategorized

What is text-to-speech synthesis?

Evaluation of Speech Synthesis

Published by Esther on April 25, 2025April 25, 2025

0 Comments

Leave a Reply Cancel reply

Related Posts

Uncategorized

Homograph Disambiguation

Uncategorized

Interspeech 2025 Rotterdam

Uncategorized

What is text-to-speech synthesis?