I am working on detecting rhymes in Python using the Carnegie Mellon University dictionary of pronunciation, and would like to know: How can I estimate the phonemic similarity between two words? In other words, is there an algorithm that can identify the fact that "hands" and "plans" are closer to rhyming than are "hands" and "fries"?
Some context: At first, I was willing to say that two words rhyme if their primary stressed syllable and all subsequent syllables are identical (c06d if you want to replicate in Python):
def create_cmu_sound_dict():
final_sound_dict = {}
with open('resources/c06d/c06d') as cmu_dict:
cmu_dict = cmu_dict.read().split("\n")
for i in cmu_dict:
i_s = i.split()
if len(i_s) > 1:
word = i_s[0]
syllables = i_s[1:]
final_sound = ""
final_sound_switch = 0
for j in syllables:
if "1" in j:
final_sound_switch = 1
final_sound += j
elif final_sound_switch == 1:
final_sound += j
final_sound_dict[word.lower()] = final_sound
return final_sound_dict
If I then run
print cmu_final_sound_dict["hands"]
print cmu_final_sound_dict["plans"]
I can see that hands and plans sound very similar. I could work towards an estimation of this similarity on my own, but I thought I should ask: Are there sophisticated algorithms that can tie a mathematical value to this degree of sonic (or auditory) similarity? That is, what algorithms or packages can one use to mathematize the degree of phonemic similarity between two words? I realize this is a large question, but I would be most grateful for any advice others can offer on this question.