E.g. the Soundex algorithm is optimized for English. Is there a more universal algorithm that would apply across large families of languages?
How to make an International Soundex?
SOUNDEX is indeed English-oriented. Two others that take a wider variety of phonetic differences into account are: Double Metaphone and NYSIIS.
They produce encodings into a much larger possible space than SOUNDEX does. Double Metaphone, specifically, includes reductions with the express purpose of handling alternate pronunciations based on more languages than English.
I did a presentation on fuzzy string matching recently, the slides may be helpful.
The link to your slides is broken (404) –
Belted
@John: new link seems to be asymmetrical-view.com/talks/#fuzzy-string-matching –
Gnostic
Thanks, I just updated it to point to the PDF in the related github repo - I hope that stays more constant. Thanks. –
Abysm
On Slide 38, you're showing percentage similarities that are above %50 - I'm not saying it's wrong, but what formula are you using to calculate the similarity percentage from the edit distance? The formula I've seen
1 / (1 + dist)
maxes out at 50% for inexact matches. I know your costs are variable, but 1 / 1.4 != %93
which is the number you show in your slide. Thanks! –
Veilleux I may not have the version you do - for me slide 38 is an edit distance grid :( Which words are being compared that you're looking at? The distance formula I usually use is (max(len(a),len(b)) - num_edits) / max(len(a),len(b)). If you're looking at the Text Brew algorithm, it allows different costs for the various edits, I'm pretty sure I used the same formula - there is sample code in the github repo...if you can tell me what's on the slide in question I can probably better answer your question...or email me and we'll figure it out. –
Abysm
© 2022 - 2024 — McMap. All rights reserved.