Good Python modules for fuzzy string comparison? [closed]
Asked Answered
A

12

232

I'm looking for a Python module that can do simple fuzzy string comparisons. Specifically, I'd like a percentage of how similar the strings are. I know this is potentially subjective so I was hoping to find a library that can do positional comparisons as well as longest similar string matches, among other things.

Basically, I'm hoping to find something that is simple enough to yield a single percentage while still configurable enough that I can specify what type of comparison(s) to do.

Atrocity answered 25/3, 2009 at 16:25 Comment(2)
While not specific to Python, here is a question about similar string algorithms: https://mcmap.net/q/76149/-similar-string-algorithm/…Incessant
possible duplicate of Text difference algorithmCaddaric
P
164

Levenshtein Python extension and C library.

https://github.com/ztane/python-Levenshtein/

The Levenshtein Python C extension module contains functions for fast computation of - Levenshtein (edit) distance, and edit operations - string similarity - approximate median strings, and generally string averaging - string sequence and set similarity It supports both normal and Unicode strings.

$ pip install python-levenshtein
...
$ python
>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
    Compute similarity of two strings.

    ratio(string1, string2)

    The similarity is a number between 0 and 1, it's usually equal or
    somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
    based on real minimal edit distance.

    Examples:
    >>> ratio('Hello world!', 'Holly grail!')
    0.58333333333333337
    >>> ratio('Brian', 'Jesus')
    0.0

>>> help(Levenshtein.distance)
distance(...)
    Compute absolute Levenshtein distance of two strings.

    distance(string1, string2)

    Examples (it's hard to spell Levenshtein correctly):
    >>> distance('Levenshtein', 'Lenvinsten')
    4
    >>> distance('Levenshtein', 'Levensthein')
    2
    >>> distance('Levenshtein', 'Levenshten')
    1
    >>> distance('Levenshtein', 'Levenshtein')
    0
Paraldehyde answered 26/3, 2009 at 7:18 Comment(5)
Just wanted to note, for future readers of this thread that happen to be using NLTK in their project, that nltk.metrics.edit_distance('string1', 'string2') will calculate the Levenshtein distance between string1 and string2. So if you're using NLTK like me you might not need to download a Levenshtein library besides this. CheersJezebel
now is available via PyPiInexact
While NLTK has the edit_distance method, it is pure python. If you are heavily using it, either python-levenshtein or jellyfish can provide a huge speedup... (In my setup I measured >10 times)Mordvin
A slightly newer version of the package can be found at pypi.python.org/pypi/python-LevenshteinCalutron
The PyPi package newly supports Python 3 too (0.11.1)Nanette
T
249

difflib can do it.

Example from the docs:

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

Check it out. It has other functions that can help you build something custom.

Thomasthomasa answered 25/3, 2009 at 16:34 Comment(3)
I've actually used difflib before, but found that I couldn't just ask for a percentage match amount. Its been a while though.Atrocity
@Soviut: e.g. difflib.SequenceMatcher(None, 'foo', 'bar').ratio() returns a value between 0-1 which can be interpreted as match percentage. Right?Electronics
Btw -- difflib is a python, not c library (unlike pyLevenshtein, which can be hard to install on windows); thus it's not too fast; I get about a 5-fold speed increase by running it under pypy instead of cPython. Intriguingly, while quick_ratio (which just calculates the number of characters that are the same, regardless of order) is about 3 times faster than ratio() on cPython, I see no speedup on pypy. (In case you're wondering, real_quick_ratio, just calculates the difference in lengths of the two strings, so it's unlikely to be useful if you expect the two strings to be similar lengths).Dunbar
P
164

Levenshtein Python extension and C library.

https://github.com/ztane/python-Levenshtein/

The Levenshtein Python C extension module contains functions for fast computation of - Levenshtein (edit) distance, and edit operations - string similarity - approximate median strings, and generally string averaging - string sequence and set similarity It supports both normal and Unicode strings.

$ pip install python-levenshtein
...
$ python
>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
    Compute similarity of two strings.

    ratio(string1, string2)

    The similarity is a number between 0 and 1, it's usually equal or
    somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
    based on real minimal edit distance.

    Examples:
    >>> ratio('Hello world!', 'Holly grail!')
    0.58333333333333337
    >>> ratio('Brian', 'Jesus')
    0.0

>>> help(Levenshtein.distance)
distance(...)
    Compute absolute Levenshtein distance of two strings.

    distance(string1, string2)

    Examples (it's hard to spell Levenshtein correctly):
    >>> distance('Levenshtein', 'Lenvinsten')
    4
    >>> distance('Levenshtein', 'Levensthein')
    2
    >>> distance('Levenshtein', 'Levenshten')
    1
    >>> distance('Levenshtein', 'Levenshtein')
    0
Paraldehyde answered 26/3, 2009 at 7:18 Comment(5)
Just wanted to note, for future readers of this thread that happen to be using NLTK in their project, that nltk.metrics.edit_distance('string1', 'string2') will calculate the Levenshtein distance between string1 and string2. So if you're using NLTK like me you might not need to download a Levenshtein library besides this. CheersJezebel
now is available via PyPiInexact
While NLTK has the edit_distance method, it is pure python. If you are heavily using it, either python-levenshtein or jellyfish can provide a huge speedup... (In my setup I measured >10 times)Mordvin
A slightly newer version of the package can be found at pypi.python.org/pypi/python-LevenshteinCalutron
The PyPi package newly supports Python 3 too (0.11.1)Nanette
E
70

As nosklo said, use the difflib module from the Python standard library.

The difflib module can return a measure of the sequences' similarity using the ratio() method of a SequenceMatcher() object. The similarity is returned as a float in the range 0.0 to 1.0.

>>> import difflib

>>> difflib.SequenceMatcher(None, 'abcde', 'abcde').ratio()
1.0

>>> difflib.SequenceMatcher(None, 'abcde', 'zbcde').ratio()
0.80000000000000004

>>> difflib.SequenceMatcher(None, 'abcde', 'zyzzy').ratio()
0.0
Eikon answered 10/3, 2010 at 17:3 Comment(2)
Not terribly impressed by SequenceMatcher. It gives the same score to David/Daved that it gives to David/david.Miracidium
You'll get the same problem with Levenshtein distance. If you don't care about the case, you should just call lower() on each argument before comparing them.Cardsharp
A
39

Jellyfish is a Python module which supports many string comparison metrics including phonetic matching. Pure Python implementations of Levenstein edit distance are quite slow compared to Jellyfish's implementation.

Example Usage:

import jellyfish

>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
2 
>>> jellyfish.jaro_distance('jellyfish', 'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance('jellyfish', 'jellyfihs')
1
>>> jellyfish.metaphone('Jellyfish')
'JLFX'
>>> jellyfish.soundex('Jellyfish')
'J412'
>>> jellyfish.nysiis('Jellyfish')
'JALYF'
>>> jellyfish.match_rating_codex('Jellyfish')
'JLLFSH'`
Atomicity answered 3/12, 2011 at 19:20 Comment(4)
This looks like a great library, as it has several string comparison algorithms and not just one: Levenshtein Distance, Damerau-Levenshtein Distance, Jaro Distance, Jaro-Winkler Distance, Match Rating Approach Comparison, Hamming DistanceDoubletalk
I'm lazy, clicking links is hard. Examples in the answer would be great.Bezant
n.b. Jellyfish doesn't deal well with unicode stringsDalia
Is it possible to add matching examples to the jellyfish library? In other words, we would like the library classify some specific pairs of words as match?Stilu
S
23

I like nosklo's answer; another method is the Damerau-Levenshtein distance:

"In information theory and computer science, Damerau–Levenshtein distance is a 'distance' (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two characters."

An implementation in Python from Wikibooks:

def lev(a, b):
    if not a: return len(b)
    if not b: return len(a)
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), \
    lev(a[1:], b)+1, lev(a, b[1:])+1)

More from Wikibooks, this gives you the length of the longest common substring (LCS):

def LCSubstr_len(S, T):
    m = len(S); n = len(T)
    L = [[0] * (n+1) for i in xrange(m+1)]
    lcs = 0
    for i in xrange(m):
        for j in xrange(n):
            if S[i] == T[j]:
                L[i+1][j+1] = L[i][j] + 1
                lcs = max(lcs, L[i+1][j+1])
    return lcs
Sorghum answered 25/3, 2009 at 16:46 Comment(2)
Thanks, I found some information about Levenshtein while doing my initial searching, but the examples were far too vague. Your answer is excellent.Atrocity
I chose this one because it gives me a nice scalar number I can work with and use for thresholds.Atrocity
I
18

There is also Google's own google-diff-match-patch ("Currently available in Java, JavaScript, C++ and Python").

(Can't comment on it, since I have only used python's difflib myself)

Iong answered 25/3, 2009 at 17:47 Comment(0)
R
8

Another alternative would be to use the recently released package FuzzyWuzzy. The various functions supported by the package are also described in this blogpost.

Ruffianism answered 29/8, 2011 at 18:41 Comment(0)
M
5

I am using double-metaphone which works like a charm.

An example:

>>> dm(u'aubrey')
('APR', '')
>>> dm(u'richard')
('RXRT', 'RKRT')
>>> dm(u'katherine') == dm(u'catherine')
True

Update: Jellyfish also has it. Comes under Phonetic encoding.

Mcatee answered 16/12, 2011 at 6:30 Comment(0)
H
4

I've been using Fuzzy Wuzzy from Seat Geek with great success.

https://github.com/seatgeek/fuzzywuzzy

Specifically the token set ratio function...

They also did a great write up on the process of fuzzy string matching:

http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python

Haemostasis answered 14/8, 2013 at 3:7 Comment(0)
T
3

Here's a python script for computing longest common substring in two words(may need tweaking to work for multi-word phrases):

def lcs(word1, word2):

    w1 = set(word1[i:j] for i in range(0, len(word1))
             for j in range(1, len(word1) + 1))

    w2 = set(word2[i:j] for i in range(0, len(word2))
             for j in range(1, len(word2) + 1))

    common_subs = w1.intersection(w2)

    sorted_cmn_subs = sorted([
        (len(str), str) for str in list(common_subs)
        ])

    return sorted_cmn_subs.pop()[1]
Tallith answered 20/4, 2009 at 16:32 Comment(0)
F
3

Heres the way how it can be done using Charicar's simhash, this is also suitable for long documents, it will detect 100% similarity also when you change order of words in documents too

http://blog.simpliplant.eu/calculating-similarity-between-text-strings-in-python/

Fusil answered 19/11, 2011 at 11:21 Comment(0)
D
2

Take a look at the Fuzzy module. It has fast (written in C) based algorithms for soundex, NYSIIS and double-metaphone.

A good introduction can be found at: http://www.informit.com/articles/article.aspx?p=1848528

Dorrie answered 3/4, 2012 at 12:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.