Is there a way to filter a django queryset based on string similarity (a la python difflib)?

S

4

13

I have a need to match cold leads against a database of our clients.

The leads come from a third party provider in bulk (thousands of records) and sales is asking us to (in their words) "filter out our clients" so they don't try to sell our service to a established client.

Obviously, there are misspellings in the leads. Charles becomes Charlie, Joseph becomes Joe, etc. So I can't really just do a filter comparing lead_first_name to client_first_name, etc.

I need to use some sort of string similarity mechanism.

Right now I'm using the lovely difflib to compare the leads' first and last names to a list generated with Client.objects.all(). It works, but because of the number of clients it tends to be slow.

I know that most sql databases have soundex and difference functions. See my test of it in the update below - it doesn't work as well as difflib.

Is there another solution? Is there a better solution?

Edit:

Soundex, at least in my db, doesn't behave as well as difflib.

Here is a simple test - look for "Joe Lopes" in a table containing "Joseph Lopes":

with temp (first_name, last_name) as (
select 'Joseph', 'Lopes'
union
select 'Joe', 'Satriani'
union
select 'CZ', 'Lopes'
union
select 'Blah', 'Lopes'
union
select 'Antonio', 'Lopes'
union
select 'Carlos', 'Lopes'
)
select first_name, last_name
  from temp
 where difference(first_name+' '+last_name, 'Joe Lopes') >= 3
 order by difference(first_name+' '+last_name, 'Joe Lopes')

The above returns "Joe Satriani" as the only match. Even reducing the similarity threshold to 2 doesn't return "Joseph Lopes" as a potential match.

But difflib does a much better job:

difflib.get_close_matches('Joe Lopes', ['Joseph Lopes', 'Joe Satriani', 'CZ Lopes', 'Blah Lopes', 'Antonio Lopes', 'Carlos Lopes'])
['Joseph Lopes', 'CZ Lopes', 'Carlos Lopes']

Edit after gruszczy's response:

Before writing my own, I looked for and found a T-SQL implementation of Levenshtein Distance in the repository of all knowledge.

In testing it, it still won't do a better matching job than difflib.

Which led me to research what algorithm is behind difflib. It seems to be a modified version of the Ratcliff-Obershelp algorithm.

Unhappily I can't seem to find some other kind soul who has already created a T-SQL implementation based on difflib's... I'll try my hand at it when I can.

If nobody else comes up with a better answer in the next few days, I'll grant it to gruszczy. Thanks, kind sir.

Snappish answered 29/7, 2010 at 19:57 Comment(0)

O

2

soundex won't help you, because it's a phonetic algorithm. Joe and Joseph aren't similar phonetically, so soundex won't mark them as similar.

You can try Levenshtein distance, which is implemented in PostgreSQL. Maybe in your database too and if not, you should be able to write a stored procedure, which will calculate the distance between two strings and use it in your computation.

Oleviaolfaction answered 4/8, 2010 at 14:16 Comment(3)

I faced a similar situation in one of my projects. We generated and stored custom soundex values instead of relying on the built in soundex. This somewhat improved the matches. But as gruszczy says this cannot be solved by soundex alone. – Giese 5/8, 2010 at 7:24

A db function is a good idea. But Levenshtein is not as good as difflib - see my update above. – Snappish 5/8, 2010 at 15:11

Can't you just set greater distance for levenshtein function? The greater it is, the more results it will produce. – Oleviaolfaction 5/8, 2010 at 15:57

M

3

As per the answer of andilabs you can use the Levenshtein function to create your custom function. Postgres doc indicates that the Levenshtein function is as follows:

levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int levenshtein(text source, text target) returns int

andilabs answer can use the only second function. If you want a more advanced search with insertion/deletion/substitution costs, you can rewrite function like this:

from django.db.models import Func

class Levenshtein(Func):
    template = "%(function)s(%(expressions)s, '%(search_term)s', %(ins_cost)d, %(del_cost)d, %(sub_cost)d)"
    function = 'levenshtein'

    def __init__(self, expression, search_term, ins_cost=1, del_cost=1, sub_cost=1, **extras):
        super(Levenshtein, self).__init__(
            expression,
            search_term=search_term,
            ins_cost=ins_cost,
            del_cost=del_cost,
            sub_cost=sub_cost,
            **extras
        )

And call the function:

from django.db.models import F

Spot.objects.annotate(
    lev_dist=Levenshtein(F('name'), 'Kfaka', 3, 3, 1)  # ins = 3, del = 3, sub = 1
).filter(
    lev_dist__lte=2
)

Moisesmoishe answered 2/4, 2021 at 9:29 Comment(0)

O

2

soundex won't help you, because it's a phonetic algorithm. Joe and Joseph aren't similar phonetically, so soundex won't mark them as similar.

You can try Levenshtein distance, which is implemented in PostgreSQL. Maybe in your database too and if not, you should be able to write a stored procedure, which will calculate the distance between two strings and use it in your computation.

Oleviaolfaction answered 4/8, 2010 at 14:16 Comment(3)

I faced a similar situation in one of my projects. We generated and stored custom soundex values instead of relying on the built in soundex. This somewhat improved the matches. But as gruszczy says this cannot be solved by soundex alone. – Giese 5/8, 2010 at 7:24

A db function is a good idea. But Levenshtein is not as good as difflib - see my update above. – Snappish 5/8, 2010 at 15:11

Can't you just set greater distance for levenshtein function? The greater it is, the more results it will produce. – Oleviaolfaction 5/8, 2010 at 15:57

E

2

It's possible with trigram_similar lookups since Django 1.10, see docs for PostgreSQL specific lookups and Full text search

Eldredge answered 25/7, 2016 at 15:22 Comment(0)

U

2

If you need getting there with django and postgres and don't want to use introduced in 1.10 trigram-similarity https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/lookups/#trigram-similarity you can implement using Levensthein like these:

Extension needed fuzzystrmatch

you need adding postgres extension to your db in psql:

CREATE EXTENSION fuzzystrmatch;

Lets define custom function with wich we can annotate queryset. It just take one argument the search_term and uses postgres levenshtein function (see docs):

from django.db.models import Func

class Levenshtein(Func):
    template = "%(function)s(%(expressions)s, '%(search_term)s')"
    function = "levenshtein"

    def __init__(self, expression, search_term, **extras):
        super(Levenshtein, self).__init__(
            expression,
            search_term=search_term,
            **extras
        )

then in any other place in project we just import defined Levenshtein and F to pass the django field.

from django.db.models import F

Spot.objects.annotate(
    lev_dist=Levenshtein(F('name'), 'Kfaka')
).filter(
    lev_dist__lte=2
)

Underage answered 6/4, 2018 at 15:27 Comment(1)

What if I want to add costs to the insertion, deletion, and substitution? How can I use this class then? – Moisesmoishe 1/4, 2021 at 12:53

Recommended topics

Hot tags