How to make an International Soundex?
Asked Answered
H

1

18

E.g. the Soundex algorithm is optimized for English. Is there a more universal algorithm that would apply across large families of languages?

Hulsey answered 24/9, 2008 at 14:42 Comment(0)
A
16

SOUNDEX is indeed English-oriented. Two others that take a wider variety of phonetic differences into account are: Double Metaphone and NYSIIS.

They produce encodings into a much larger possible space than SOUNDEX does. Double Metaphone, specifically, includes reductions with the express purpose of handling alternate pronunciations based on more languages than English.

I did a presentation on fuzzy string matching recently, the slides may be helpful.

Abysm answered 24/9, 2008 at 15:51 Comment(5)
The link to your slides is broken (404)Belted
@John: new link seems to be asymmetrical-view.com/talks/#fuzzy-string-matchingGnostic
Thanks, I just updated it to point to the PDF in the related github repo - I hope that stays more constant. Thanks.Abysm
On Slide 38, you're showing percentage similarities that are above %50 - I'm not saying it's wrong, but what formula are you using to calculate the similarity percentage from the edit distance? The formula I've seen 1 / (1 + dist) maxes out at 50% for inexact matches. I know your costs are variable, but 1 / 1.4 != %93 which is the number you show in your slide. Thanks!Veilleux
I may not have the version you do - for me slide 38 is an edit distance grid :( Which words are being compared that you're looking at? The distance formula I usually use is (max(len(a),len(b)) - num_edits) / max(len(a),len(b)). If you're looking at the Text Brew algorithm, it allows different costs for the various edits, I'm pretty sure I used the same formula - there is sample code in the github repo...if you can tell me what's on the slide in question I can probably better answer your question...or email me and we'll figure it out.Abysm

© 2022 - 2024 — McMap. All rights reserved.