compare short strings in different languages for similar sound - is Soundex the answer?

Asked 26/5, 2011 at 15:18 Answered 24/12, 2012 at 2:6

Solved unicode match soundex similarity phonetics

How could i get a sound similarity "rating" for a string written in one language with another string in another language: i.e an algorithm that will identify that

"David Letterman" and "דוד לטרמן" are strings that sound alike.

-Oh, yes, btw the above is Hebrew for, you guessed it: "David Letterman", and it sounds/spoken almost the same as in English..

The only raw material I have is strings in unicode in their respective languages. That is, i do not have phonemes or phonetic transcriptions/translations of the strings.

I Have already implemented a Soundex implementation tweak kinda thing, which works so-so. Is this the way to go?

Ebonee answered 26/5, 2011 at 15:18 Comment(1)

Dan04's solution works like a charm: better than expected. managed to merge contact lists of named persons (first+last) with Hebrew/English comparison, duplicates, misspelled and similarly spelled names in each language and between those languages. No stats to give, but works almost perfect. – Ebonee 5/6, 2011 at 12:3

Soundex may not be perfect, but it seems like a reasonable approach, at least for your specific example of English/Hebrew matching.

You definitely can't use the rule about preserving the first letter of the name, but I never liked that even for the Latin alphabet (because I'd have to look under both "E" and "Y" for my mother's family name). I recommend just treating the first letter like all the others.

Then it's just a matter of mapping the Hebrew letters to Soundex codes. You don't really need an intermediate English transliteration; just code the Hebrew → Soundex mapping directly.

בוףפ → 1
גזחךכסקש → 2
דטת → 3
ץצ → 32
ל → 4
םמןנ → 5
ר → 6
אהיע → ignored

However, because Soundex is English-centric, it may not correctly handle certain ambiguities in the Hebrew pronunciation:

ו is mapped to 1 (like English V) in the list above, but it often represents O, U, or W, in which case it should be ignored in Soundex.
ח is hard to classify due to its lack of an English equivalent. I put it in category 2 because this (1) matches the "ch" transliteration, and (2) allows ך/כ to have the same category with or without a dagesh.
Ashkenazi pronuncation would split ת between categories 2 and 3.

To deal with this, you could generate multiple Soundex keys for a string. E.g., "שבת" would map to both 212 and 213.

Similar mappings can be made for Greek:

ΒΠΦ → 1
Ψ → 12
ΓΖΚΞΣΧ → 2
ΔΘΤ → 3
Λ → 4
ΜΝ → 5
Ρ → 6
ΑΕΗΙΟΥΩ → ignored

or Russian:

БВПФ → 1
ГЖЗКСХЧШЩ → 2
ДТ → 3
Ц → 32
Л → 4
МН → 5
Р → 6
АЕЁИЙОУЪЫЬЭЮЯ → ignored

(Note that some of the 2's might be 32's, depending on your transliteration convention.)

A similarity "rating" can be obtained based on a metric like longest common subsequence length or Levenshtein distance on the Soundex values.

For example, you can define the "similarity" between two strings as 2*lcslen(A, B)/(len(A)+len(B)) to obtain a score between 0 and 1.

Antakiya answered 29/5, 2011 at 11:5 Comment(0)

I'd suggest looking into Daitch-Mokotoff Soundex Code (particularly good with Hebrew). Check this, which takes English characters as input and this, which takes Hebrew characters as input

Penetrant answered 24/12, 2012 at 2:6 Comment(0)

It is unlikely that Soundex is appropriate in general; it is rather crude and somewhat attuned to English. In particular, the first character of the Soundex string is the first character of the input, so your English/Hebrew example will not translate to the same Soundex code unless you also transliterate the Hebrew characters to English (Latin) first. Both Cyrillic and Chinese have transliterations from the native character set to Latin - but there are variations in how it is done.

Investigate Metaphone; however, it is conceptually similar to Soundex and has similar limitations.

I don't know of a cross-lingual equivalent.

I don't know if the IPA (International Phonetic Alphabet) would help. You'd have to translate the English and the Hebrew to the IPA, and then use some similarity function to associate related sounds.

Templia answered 26/5, 2011 at 16:2 Comment(1)

THX. indeed, in my Soundex tweak I have changed the first letter in the foreign language to it's English equivalent. Therefore one needs in such a Soundex implementation 2 mappings: a mapping for each char in the foreign language to it's English exact equivalent, and a mapping of each letter into one of the 6 sets that are used to calculate the Soundex value for a string. – Ebonee 27/5, 2011 at 12:10

Recommended topics

Hot tags