Improve Django queryset performance when using annotate Exists
Asked Answered
H

2

7

I have a queryset that returns a lot of data, it can be filtered by year which will return around 100k lines, or show all which will bring around 1 million lines.

The objective of this annotate is to generate a xlsx spreadsheet.

Models representation, RelatedModel is manytomany between Model and AnotherModel

Model:
    id
    field1
    field2
    field3

RelatedModel:
    foreign_key_model (Model)
    foreign_key_another (AnotherModel)

Queryset, if the relation exists it will annotate, this annotate is very slow and can take several minutes.

Model.objects.all().annotate(
    related_exists=Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id'))),
    related_column=Case(
        When(related_exists=True, then=Value('The relation exists!')),
        When(related_exists=False, then=Value('The relation doesn't exist!')),
        default=Value('This is the default value!'),
        output_field=CharField(),
    )
).values_list(
    'related_column',
    'field1',
    'field2',
    'field3'
)
Highbred answered 27/11, 2019 at 22:14 Comment(3)
Use pagination?Precatory
The objective is to generate a spreadsheet (xlsx file)Highbred
Yes, you can write in batches.Precatory
P
8

If only thing needed is to change how True / False is displayed in xlsx - one option is to just have one related_exists BooleanField annotation and later customize how it will be converted when creating xlsx document - i.e. in serializer. Database should store raw / unformatted values, and app prepare them to be shown to user.

Other things to consider:

  • Indexes to speed-up filtering.
  • If you have millions of records after filtering, in one table - maybe table partitioning could be considered.

But let's look into raw sql of original query. It will be like this:

SELECT [model_fields],
       EXISTS([CLIENT_SELECT]) AS related_exists,
       CASE
       WHEN EXISTS([CLIENT_SELECT]) = true THEN 'The relation exists!'
       WHEN EXISTS([CLIENT_SELECT]) = true THEN 'The relation does not exist!'
       ELSE 'The relation exists!'
       END AS related_column
FROM model;

And right away we can see nested query for Exists CLIENT_SELECT is there 3 times. Even though it is exactly the same, it may be executed minimum 2 times and up to 3 times. Database may optimize it to be faster than 3x, but it still is not optimal as 1x.

First, EXISTS returns either True or False, we can leave just one check that it is True, making 'The relation does not exist!' the default value.

    related_column=Case(
        When(related_exists=True, then=Value('The relation exists!')),
        default=Value('The relation does not exist!')

Why related_column performs same select again and not takes the value of related_exists?

Because we cannot reference calculated columns while calculating another columns - and this is database level constraint django knows about and duplicates expression.

Wait, then we actually do not need related_exists column, lets just leave related_column with CASE statement and 1 exists subquery.

Here comes Django - we cannot (till 3.0) use expressions in filters without annotating them first.

So, it our case it is like: in order to use Exist in When, we first need to add it as annotation, but it won't be used as a reference, but a full copy of expression.


Good news!

Since Django 3.0 we can use expressions that output BooleanField directly in QuerySet filters, without having to first annotate. Exists is one of such BooleaField expressions.

Model.objects.all().annotate(
    related_column=Case(
        When(
            Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id'))),
            then=Value('The relation exists!'),
        ),
        default=Value('The relation doesn't exist!'),
        output_field=CharField(),
    )
)

And only one nested select, and one annotated field.


Django 2.1, 2.2

Here's the commit that finalized allowance of boolean expressions although many pre-conditions for it were added earlier. One of them is presence of conditional attribute on expression object and check for this attribute.

So, although not recommended and not tested it seems quite working little hack for Django 2.1, 2.2 (before there was no conditional check, and it will require more intrusive changes):

  • create Exists expression instance
  • monkey patch it with conditional = True
  • use it as condition in When statement
related_model_exists = Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id')))

setattr(related_model_exists, 'conditional', True)

Model.objects.all().annotate(
    related_column=Case(
        When(
            relate_model_exists,
            then=Value('The relation exists!'),
        ),
        default=Value('The relation doesn't exist!'),
        output_field=CharField(),
    )
)


Related checks

relatedmodel_set__isnull=True check is not suitable for several reasons:

  • it performs LEFT OUTER JOIN - that is less efficient than EXISTS
  • it performs LEFT OUTER JOIN - it joins tables, this makes it ONLY suitable in filter() condition (not in annotate - When), and only for OneToOne or OneToMany (One is on relatedmodel side) relations
Potvaliant answered 14/12, 2019 at 10:52 Comment(1)
setattr(related_model_exists, 'conditional', True) not works for dj 2 =(Chryso
F
2

You can considerably simplify your query to:

from django.db.models import Count
Model.objects.all().annotate(
    related_column=Case(
        When(relatedmodel_set__isnull=True, then=Value("The relation doesn't exist!")), 
        default=Value("The relation exists!"), 
        output_field=CharField()
    )
)

Where relatedmodel_set is the related_name on your foreign key.

Familist answered 14/12, 2019 at 5:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.