Django max similarity (TrigramSimilarity) from ManyToManyField

Asked 3/2, 2018 at 23:34 Answered 9/2, 2018 at 15:10

Solved python django postgresql django-queryset trigram

I have to implement a search function which will be fault tolerant.
Currently, I have the following situation:

Models:

class Tag(models.Model):
    name = models.CharField(max_length=255)

class Illustration(models.Model):
    name = models.CharField(max_length=255)
    tags = models.ManyToManyField(Tag)

Query:

queryset.annotate(similarity=TrigramSimilarity('name', fulltext) + TrigramSimilarity('tags__name', fulltext))

Example data:

Illustrations:

ID |  Name  |        Tags       |
---|--------|-------------------|
 1 | "Dog"  | "Animal", "Brown" |
 2 | "Cat"  | "Animals"         |

Illustration has Tags:

ID_Illustration | ID_Tag |
----------------|--------|
       1        |    1   |
       1        |    2   |
       2        |    3   |

Tags:

ID_Tag |   Name   |
-------|----------|
   1   |  Animal  |
   2   |  Brown   |
   3   |  Animals |

When I run the query with "Animal", the similarity for "Dog" should be higher than for "Cat", as it is a perfect match.
Unfortunately, both tags are considered together somehow.
Currently, it looks like it's concatenating the tags in a single string and then checks for similarity:

TrigramSimilarity("Animal Brown", "Animal") => X

But I would like to adjust it in a way that I will get the highest similarity between an Illustration instance name and its tags:

Max([
    TrigramSimilarity('Name', "Animal"), 
    TrigramSimilarity("Tag_1", "Animal"), 
    TrigramSimilarity("Tag_2", "Animal"),
]) => X

Edit1: I'm trying to query all Illustration, where either the title or one of the tags has a similarity bigger than X.

Edit2: Additional example:

fulltext = 'Animal'

TrigramSimilarity('Animal Brown', fulltext) => x TrigramSimilarity('Animals', fulltext) => y

Where x < y

But what I want is actually

TrigramSimilarity(Max(['Animal', 'Brown]), fulltext) => x (Similarity to Animal) TrigramSimilarity('Animals', fulltext) => y

Where x > y

Engadine answered 3/2, 2018 at 23:34 Comment(8)

Can you precise from which model you are doing the queryset? Also, what are those example data ? the name+id of the illustration and the tags ? – Irreverent 6/2, 2018 at 11:31

Question adjusted – Engadine 6/2, 2018 at 13:26

Edit 2 added (for more details). – Engadine 6/2, 2018 at 18:0

is TrigramSimilarity a python function or does it translate to SQL? If it's python, I don't think you can use it in the annotate call as you're trying to do. – Leta 7/2, 2018 at 6:27

TrigramSimilarity is part of the Django Framework ( docs.djangoproject.com/en/2.0/ref/contrib/postgres/search), but it requires a PostgreSQL Database (won't work with SQLite) with the pg_trgm extension activated. – Engadine 8/2, 2018 at 10:29

@Engadine can I ask if you've tried my solution ? – Nathanialnathaniel 12/2, 2018 at 13:59

@Paolo: I didn't had the time so far, but I will try it in the evening and give you feedback. Sorry, but I already started applying Johns approach, before you posted your solution. – Engadine 12/2, 2018 at 14:36

@Engadine the next time you choose an answer to your question, accept it immediately, you will avoid people taking the time to write a solution that will not be taken into consideration by you. Anyway, I'm waiting your feedback about my solution. – Nathanialnathaniel 12/2, 2018 at 14:46

You cannot break up the tags__name (at least I don't know a way).
From your examples, I can assume 2 possible solutions (1st solution is not strictly using Django):

Not everything needs to pass strictly through Django
We have Python powers, so let's use them:

Let us compose the query first:
```
from difflib import SequenceMatcher

from django.db.models import Q

def create_query(fulltext):
    illustration_names = Illustration.objects.values_list('name', flat=True)
    tag_names = Tag.objects.values_list('name', flat=True)
    query = []

    for name in illustration_names:
        score = SequenceMatcher(None, name, fulltext).ratio()
        if score == 1:
            # Perfect Match for name
            return [Q(name=name)]

         if score >= THRESHOLD:
            query.append(Q(name=name))

    for name in tag_names:
        score = SequenceMatcher(None, name, fulltext).ratio()
        if score == 1:
            # Perfect Match for name
            return [Q(tags__name=name)]

         if score >= THRESHOLD:
            query.append(Q(tags__name=name))

    return query
```
Then to create your queryset:
```
from functools import reduce # Needed only in python 3
from operator import or_

queryset = Illustration.objects.filter(reduce(or_, create_query(fulltext)))
```
Decode the above:

We are checking every Illustration and Tag name against our fulltext and we are composing a query with every name that it's similarity passes the THRESHOLD.
- SequenceMatcher method compares sequences and returns a ratio 0 < ratio < 1 where 0 indicates No-Match and 1 indicates Perfect-Match. Check this answer for another usage example: Find the similarity percent between two strings (Note: There are other strings comparing modules as well, find one that suits you)
- Q() Django objects, allow the creation of complex queries (more on the linked docs).
- With the operator and reduce we transform a list of Q() objects to an OR separated query argument:
  Q(name=name_1) | Q(name=name_2) | ... | Q(tag_name=tag_name_1) | ...
Note: You need to define an acceptable THRESHOLD.
As you can imagine this will be a bit slow but it is to be expected when you need to do a "fuzzy" search.

(The Django Way:)
Use a query with a high similarity threshold and order the queryset by this similarity rate:
```
queryset.annotate(
    similarity=Greatest(
        TrigramSimilarity('name', fulltext), 
        TrigramSimilarity('tags__name', fulltext)
    )).filter(similarity__gte=threshold).order_by('-similarity')
```
Decode the above:
- Greatest() accepts an aggregation (not to be confused with the Django method aggregate) of expressions or of model fields and returns the max item.
- TrigramSimilarity(word, search) returns a rate between 0 and 1. The closer the rate is to 1, the more similar the word is to search.
- .filter(similarity__gte=threshold), will filter similarities lower than the threshold.
- 0 < threshold < 1. You can set the threshold to 0.6 which is pretty high (consider that the default is 0.3). You can play around with that to tune your performance.
- Finally, order the queryset by the similarity rate in a descending order.

Barfuss answered 8/2, 2018 at 14:3 Comment(5)

Unfortunately it is not working like this, as TrigramSimilarity only works like this: TrigramSimilarity(columnName, searchString). Therefore it would search for a column labeled as "Animal", "Brown" or "Animals". – Engadine 9/2, 2018 at 9:15

Thanks a lot, I will try it out. But it looks different to what I had before, as it only allows me to filter, but not to sort depending on the result or am I wrong? – Engadine 9/2, 2018 at 12:28

@Engadine you cannot sort as it is but with a slight modification you could do it (sort the name lists before the composition of the query). – Barfuss 9/2, 2018 at 13:16

I was able to resolve the issue as you described in option 1. Thanks a lot for your help! – Engadine 12/2, 2018 at 13:30

@Engadine Good to know :) – Barfuss 12/2, 2018 at 16:56

I solved it using only TrigramSimilarity, Max and Greatest.

I populated some data as in your question:

from illustrations.models import Illustration, Tag
Tag.objects.bulk_create([Tag(name=t) for t in ['Animal', 'Brown', 'Animals']])
Illustration.objects.bulk_create([Illustration(name=t) for t in ['Dog', 'Cat']])
dog=Illustration.objects.get(name='Dog')
cat=Illustration.objects.get(name='Cat')
animal=Tag.objects.get(name='Animal')
brown=Tag.objects.get(name='Brown')
animals=Tag.objects.get(name='Animals')
dog.tags.add(animal, brown)
cat.tags.add(animals)

I imported all necessary functions and initialized fulltext:

from illustrations.models import Illustration
from django.contrib.postgres.search import TrigramSimilarity
from django.db.models.functions import Greatest
from django.db.models import Max
fulltext = 'Animal'

Then I executed the query:

Illustration.objects.annotate(
    max_similarity=Greatest(
        Max(TrigramSimilarity('tags__name', fulltext)),
        TrigramSimilarity('name', fulltext)
    )
).values('name', 'max_similarity')

With this results:

<QuerySet [{'name': 'Dog', 'max_similarity': 1.0}, {'name': 'Cat', 'max_similarity': 0.666667}]>

This is the SQL query exceuted from PostgreSQL:

SELECT "illustrations_illustration"."name", GREATEST(MAX(SIMILARITY("illustrations_tag"."name", 'Animal')), SIMILARITY("illustrations_illustration"."name", 'Animal')) AS "max_similarity"
FROM "illustrations_illustration"
LEFT OUTER JOIN "illustrations_illustration_tags" ON ("illustrations_illustration"."id" = "illustrations_illustration_tags"."illustration_id")
LEFT OUTER JOIN "illustrations_tag" ON ("illustrations_illustration_tags"."tag_id" = "illustrations_tag"."id")
GROUP BY "illustrations_illustration"."id", SIMILARITY("illustrations_illustration"."name", 'Animal')

You can use the max_similarity annotation to filter or order your results.

Nathanialnathaniel answered 9/2, 2018 at 15:10 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags