I would like to programmatically check whether a string can be pronounced or needs to be spelled out.
For example, internationalization
can be read out, but i18n
cannot, nor can hhdirgxzf
.
I can think of some simple heuristics such as checking whether the string contains non-alpha characters, but I hope there is a more robust and scientific way to do it. Are there algorithmic approaches that can score a string based on how easy it is to pronounce?
Related: Is there a way to rank the difficulty of pronunciation of a word?, however I don't have a list and I can't precompute.
Update based on comments.
- As I'm an English speaker I'm interested in English but I could imagine an algorithm that was based on the way sound and speaking works rather than the characteristics of a particular language.
- By pronounced I mean the string can be read out naturally, it's possible to pronounce
hhdirgxzf
but it would not sound one natural language word, it would need to be broken up. - a specific use case I have in mind is where I am sent strings, and I want to use a basic text-to-speech system to read them out loud. I want to determine which tokens in the string to let the TTS system try to pronounce, and which to make it spell out, erring on the side of spelling out if not confident.
i18n
, something likeeye-ate-een-en
. Your other example is a bit more of a challenge but I'll give it a go ... – Hertai18n
->eye-eighteen-en
, andhhdirgxzf
->hud-er-gux-zuf
. – Boozehu-hu-der-gez-zof
– Herta