We are currently at the stage where AI text-to-speech (TTS) can generate voices that sound as natural as human speech. There are many use cases out there and many AI companies offering these voices in a multitude of languages, dialects, speaking styles, and personas. But that doesn’t mean that the generated speech is always correct, and mistakes that are made have a real impact on how the speech is perceived and understood.

In this blog post, I would like to highlight one of the problems that often comes to light when testing different TTS voices, and that is the disambiguation of homographs, sometimes also called heteronyms. Technically, homographs are words that have the same spelling but a different meaning and heteronyms are a subset of homographs where the different meanings also have a different pronunciation, but the two terms are often used interchangeably.

Homographs are words such as ‘live’ and ‘read’, but also acronyms that need expanding when spoken such as ‘St’ and ‘Dr’. In this post, I am focusing on English but other languages also have homographs. When processing text to generate speech, it is important to normalize the text and determine how the words need to be pronounced and disambiguating homographs is part of that process.

Some homographs differ because they play a different role in the sentence, called part-of-speech they can be nouns, verbs, or adjectives. Others can just have different meanings.

objectnoun/’ah b j eh k t/verb/ah b ‘j eh k t/
advocatenoun/’ae d v ou k ah t/verb/’ae d v ou k ei t/
windnoun/’w ih n d/verb/’w ai n d/
liveverb/’l ih v/adjective/’l ai v/
readverb (present)/’r iy d/verb (past)/’r eh d/
bassnoun (fish)/’b ae s/noun (music)/’b ei s/
niceadjective/’n ai s/name/’n iy s/

Abbreviation expansion

When using TTS for GPS navigation in the US it is very common to have street names that contain directional information such as NW or SW. It is surprising that many common map apps are still unable to expand these to say ‘north-west’ or ‘south-west’, instead of spelling them out. It can be very confusing for listeners, especially since ‘en’ and ‘es’ sound very similar, but ‘north’ and ‘south’ do not.

Acronyms

There are also acronyms which need to be expanded and in cases like ‘St.’ and ‘Dr.’ there are different ways to expand them. The additional problem with these acronyms is that they are usually followed by a period and the period is also used to end a sentence. ‘St’ can be expanded to ‘Saint’ or to ‘Street’, but what happens with pronouncing the sentence “I live on St. Charles’ St.”

I found in a small test of several different online TTS demos that most of them said ‘Saint Charles Saint’. One of them even thought the first period was also marking the end of the sentence and another system failed to pronounce ‘live’ correctly even though it is obviously the verb and not the adjective.

Evaluating homograph disambiguation in TTS systems

Whenever we do any kind of evaluation, it is important to do it on a set of test sentences that has not been seen during training. And for homograph disambiguation, it is important that this set contains all pronunciation variants of the homograph, even if the distribution is very skewed in the language. Of course it is also important that all these variants are covered in the training material, otherwise the trained disambiguation model will never be able to predict the alternative, less-frequent variant. For this type of evaluation, we typically record how accurately the system predicts the correct pronunciation.

Gorman et al. (2018) collected English text data from Wikipedia for 163 different homographs they could find for English and they coded them and trained a machine learning model to disambiguate them. They made the data set public and there are many TTS systems that have used this set for training a homograph disambiguation model for English. There are at least 2 instances of every homograph in the training set.

Gorman, K., Mazovetskiy, G., and Nikolaev, V. (2018). Improving homograph disambiguation with machine learning. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, pages 1349-1352. Miyazaki, Japan. (data set)

Categories: Uncategorized

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *