Analysing and improving the performance of a query generated by Django's ORM (SORT)
Asked Answered
P

1

6

I have a not-so-complex (imho) filtering logic based on several conditions in my Django models. There is one particular query which takes an inusual amount of time to finish.

The query is built based on those two querysets:

queryset = self.serializer_class.Meta.model.valid_pricelist_objects.filter(
    Q(drug_prices__pricelist__price_destination__to_all_insurances=True
      ) |
    # pylint: disable=line-too-long
    Q(
        drug_prices__pricelist__price_destination__to_organization_data__organization__uuid
        =self.kwargs.get('i_uuid')))
return queryset

and

return super().get_queryset().filter(
    Q(active=True),
    Q(drug_prices__pricelist__active=True), # Lista de precios activa
    # Q(drug_pictures__is_main=True), # Que tenga una imagen
    # TODO: Hacer filtros por pais PriceListCountries
    Q(
        Q(drug_prices__pricelist__expires=False) | # Que tenga precios que no caducan o
        Q(
            Q(drug_prices__pricelist__expires=True), # Que tenga precios que si caducan Y
            Q(drug_prices__pricelist__datestart__date__lte=timezone.now()),  # Fecha de inicio menor que hoy Y
            Q(drug_prices__pricelist__dateend__date__gte=timezone.now())  # Fecha final mayor que hoy
        )
    )
).distinct().prefetch_related(
    'categories__contexts',
    'brands',
    'drug_prices__pricelist',
    'drug_pictures',
    'drug_prices__pricelist__price_destination',
)

The second querysets wraps the first one (via the super() call). The resulting query looks like this:

SELECT DISTINCT
    "phdrug_phdrug"."id",
    "phdrug_phdrug"."uuid",
    "phdrug_phdrug"."default_description",
    "phdrug_phdrug"."ean",
    "phdrug_phdrug"."parent_ean",
    "phdrug_phdrug"."reg_num",
    "phdrug_phdrug"."atc_iv",
    "phdrug_phdrug"."product_type",
    "phdrug_phdrug"."fraction",
    "phdrug_phdrug"."active",
    "phdrug_phdrug"."loyal",
    "phdrug_phdrug"."patent",
    "phdrug_phdrug"."chronics",
    "phdrug_phdrug"."recipe",
    "phdrug_phdrug"."deal",
    "phdrug_phdrug"."specialized",
    "phdrug_phdrug"."armored",
    "phdrug_phdrug"."hight_speciality",
    "phdrug_phdrug"."temp_8_15",
    "phdrug_phdrug"."temp_15_25",
    "phdrug_phdrug"."temp_2_8",
    "phdrug_phdrug"."temp_less_15",
    "phdrug_phdrug"."new",
    "phdrug_phdrug"."mdk_internal_code",
    "phdrug_phdrug"."mdk_single_id",
    "phdrug_phdrug"."is_from_mdk_db",
    "phdrug_phdrug"."top",
    "phdrug_phdrug"."laboratory_id",
    "phdrug_phdrug"."specialty_id" 
FROM
    "phdrug_phdrug"
    INNER JOIN "monetary_drugprice" ON ( "phdrug_phdrug"."id" = "monetary_drugprice"."drug_id" )
    INNER JOIN "monetary_pricelist" ON ( "monetary_drugprice"."pricelist_id" = "monetary_pricelist"."id" )
    INNER JOIN "monetary_drugprice" T4 ON ( "phdrug_phdrug"."id" = T4."drug_id" )
    INNER JOIN "monetary_pricelist" T5 ON ( T4."pricelist_id" = T5."id" )
    INNER JOIN "monetary_pricelistdestinations" ON ( T5."id" = "monetary_pricelistdestinations"."pricelist_id" )
    LEFT OUTER JOIN "organization_organizationdata" ON ( "monetary_pricelistdestinations"."to_organization_data_id" = "organization_organizationdata"."id" )
    LEFT OUTER JOIN "organization_organization" ON ( "organization_organizationdata"."organization_id" = "organization_organization"."id" ) 
WHERE
    (
        "phdrug_phdrug"."active" = TRUE 
        AND "monetary_pricelist"."active" = TRUE 
        AND (
            "monetary_pricelist"."expires" = FALSE 
            OR (
                "monetary_pricelist"."expires" = TRUE 
                AND ( "monetary_pricelist"."datestart" AT TIME ZONE'UTC' ) :: DATE <= '2019-01-22' 
                AND ( "monetary_pricelist"."dateend" AT TIME ZONE'UTC' ) :: DATE >= '2019-01-22' 
            )) 
        AND (
            "monetary_pricelistdestinations"."to_all_insurances" = TRUE 
            OR "organization_organization"."uuid" = 'b51773d4-05f8-43a2-86ef-0098b31725d8' 
        )) 
ORDER BY
    "phdrug_phdrug"."default_description" ASC

Running the query with EXPLAIN ANALYZE I get this:

Unique  (cost=10412.31..12666.32 rows=29084 width=143) (actual time=3373.496..3620.090 rows=6442 loops=1)
  ->  Sort  (cost=10412.31..10485.02 rows=29084 width=143) (actual time=3373.494..3460.790 rows=228667 loops=1)
        Sort Key: phdrug_phdrug.default_description, phdrug_phdrug.id, phdrug_phdrug.uuid, phdrug_phdrug.ean, phdrug_phdrug.parent_ean, phdrug_phdrug.reg_num, phdrug_phdrug.medika_code, phdrug_phdrug.atc_iv, phdrug_phdrug.product_type, phdrug_phdrug.fraction, phdrug_phdrug.active, phdrug_phdrug.loyal, phdrug_phdrug.patent, phdrug_phdrug.chronics, phdrug_phdrug.recipe, phdrug_phdrug.deal, phdrug_phdrug.specialized, phdrug_phdrug.armored, phdrug_phdrug.hight_speciality, phdrug_phdrug.temp_8_15, phdrug_phdrug.temp_15_25, phdrug_phdrug.temp_2_8, phdrug_phdrug.temp_less_15, phdrug_phdrug.new, phdrug_phdrug.mdk_internal_code, phdrug_phdrug.mdk_single_id, phdrug_phdrug.is_from_mdk_db, phdrug_phdrug.top, phdrug_phdrug.laboratory_id, phdrug_phdrug.specialty_id
        Sort Method: external merge  Disk: 31192kB
        ->  Hash Join  (cost=704.51..6166.54 rows=29084 width=143) (actual time=23.648..507.099 rows=228667 loops=1)
              Hash Cond: (monetary_drugprice.pricelist_id = monetary_pricelist.id)
              ->  Nested Loop  (cost=696.92..5604.95 rows=44105 width=147) (actual time=22.881..416.630 rows=457692 loops=1)
                    Join Filter: (phdrug_phdrug.id = monetary_drugprice.drug_id)
                    ->  Hash Join  (cost=696.51..1177.21 rows=4583 width=147) (actual time=22.864..38.841 rows=23577 loops=1)
                          Hash Cond: (phdrug_phdrug.id = t4.drug_id)
                          ->  Seq Scan on phdrug_phdrug  (cost=0.00..359.94 rows=11992 width=143) (actual time=0.438..3.593 rows=11992 loops=1)
                                Filter: active
                                Rows Removed by Filter: 2
                          ->  Hash  (cost=639.21..639.21 rows=4584 width=4) (actual time=22.339..22.339 rows=23577 loops=1)
                                Buckets: 32768 (originally 8192)  Batches: 1 (originally 1)  Memory Usage: 1085kB
                                ->  Nested Loop  (cost=3.99..639.21 rows=4584 width=4) (actual time=1.785..16.702 rows=23577 loops=1)
                                      ->  Nested Loop  (cost=3.58..9.11 rows=5 width=8) (actual time=1.756..1.874 rows=7 loops=1)
                                            ->  Hash Left Join  (cost=3.43..7.57 rows=5 width=4) (actual time=1.733..1.797 rows=7 loops=1)
                                                  Hash Cond: (monetary_pricelistdestinations.to_organization_data_id = organization_organizationdata.id)
                                                  Filter: (monetary_pricelistdestinations.to_all_insurances OR (organization_organization.uuid = 'b51773d4-05f8-43a2-86ef-0098b31725d8'::uuid))
                                                  Rows Removed by Filter: 130
                                                  ->  Seq Scan on monetary_pricelistdestinations  (cost=0.00..3.37 rows=137 width=9) (actual time=0.626..0.643 rows=137 loops=1)
                                                  ->  Hash  (cost=3.12..3.12 rows=25 width=20) (actual time=1.076..1.076 rows=25 loops=1)
                                                        Buckets: 1024  Batches: 1  Memory Usage: 10kB
                                                        ->  Hash Left Join  (cost=1.56..3.12 rows=25 width=20) (actual time=1.040..1.053 rows=25 loops=1)
                                                              Hash Cond: (organization_organizationdata.organization_id = organization_organization.id)
                                                              ->  Seq Scan on organization_organizationdata  (cost=0.00..1.25 rows=25 width=8) (actual time=0.501..0.504 rows=25 loops=1)
                                                              ->  Hash  (cost=1.25..1.25 rows=25 width=20) (actual time=0.513..0.513 rows=25 loops=1)
                                                                    Buckets: 1024  Batches: 1  Memory Usage: 10kB
                                                                    ->  Seq Scan on organization_organization  (cost=0.00..1.25 rows=25 width=20) (actual time=0.484..0.501 rows=25 loops=1)
                                            ->  Index Only Scan using monetary_pricelist_pkey on monetary_pricelist t5  (cost=0.14..0.31 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=7)
                                                  Index Cond: (id = monetary_pricelistdestinations.pricelist_id)
                                                  Heap Fetches: 7
                                      ->  Index Scan using monetary_drugprice_pricelist_id_1ce160ce on monetary_drugprice t4  (cost=0.42..110.21 rows=1581 width=8) (actual time=0.010..1.236 rows=3368 loops=7)
                                            Index Cond: (pricelist_id = t5.id)
                    ->  Index Scan using monetary_drugprice_drug_id_c2f278e5 on monetary_drugprice  (cost=0.42..0.78 rows=15 width=8) (actual time=0.002..0.009 rows=19 loops=23577)
                          Index Cond: (drug_id = t4.drug_id)
              ->  Hash  (cost=6.45..6.45 rows=91 width=4) (actual time=0.757..0.757 rows=93 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 12kB
                    ->  Seq Scan on monetary_pricelist  (cost=0.00..6.45 rows=91 width=4) (actual time=0.655..0.713 rows=93 loops=1)
                          Filter: (active AND ((NOT expires) OR (expires AND ((timezone('UTC'::text, datestart))::date <= '2019-01-22'::date) AND ((timezone('UTC'::text, dateend))::date >= '2019-01-22'::date))))
                          Rows Removed by Filter: 45
Planning time: 25.871 ms
Execution time: 3638.544 ms

If I paste this in explain.depesz.com, I get this result

query explain

Almost the entire time that the query takes to finish is spent sorting, as you can see here:

enter image description here

...and this is where I'm completely lost. What is it sorting? It's not the ORDER BY at the end of my query, I already tried removing that.

How can I improve the performance of this query?

EDIT: I'm adding the EXPLAIN ANALYZE after removing the ORDER BY. The query takes longer to finish, consistently. Which is adding another layer of wtf to the problem.

Unique  (cost=10412.31..12666.32 rows=29084 width=143) (actual time=4229.508..4427.560 rows=6442 loops=1)
  ->  Sort  (cost=10412.31..10485.02 rows=29084 width=143) (actual time=4229.506..4274.698 rows=228667 loops=1)
        Sort Key: phdrug_phdrug.id, phdrug_phdrug.uuid, phdrug_phdrug.default_description, phdrug_phdrug.ean, phdrug_phdrug.parent_ean, phdrug_phdrug.reg_num, phdrug_phdrug.medika_code, phdrug_phdrug.atc_iv, phdrug_phdrug.product_type, phdrug_phdrug.fraction, phdrug_phdrug.active, phdrug_phdrug.loyal, phdrug_phdrug.patent, phdrug_phdrug.chronics, phdrug_phdrug.recipe, phdrug_phdrug.deal, phdrug_phdrug.specialized, phdrug_phdrug.armored, phdrug_phdrug.hight_speciality, phdrug_phdrug.temp_8_15, phdrug_phdrug.temp_15_25, phdrug_phdrug.temp_2_8, phdrug_phdrug.temp_less_15, phdrug_phdrug.new, phdrug_phdrug.mdk_internal_code, phdrug_phdrug.mdk_single_id, phdrug_phdrug.is_from_mdk_db, phdrug_phdrug.top, phdrug_phdrug.laboratory_id, phdrug_phdrug.specialty_id
        Sort Method: external merge  Disk: 31160kB
        ->  Hash Join  (cost=704.51..6166.54 rows=29084 width=143) (actual time=21.814..605.830 rows=228667 loops=1)
              Hash Cond: (monetary_drugprice.pricelist_id = monetary_pricelist.id)
              ->  Nested Loop  (cost=696.92..5604.95 rows=44105 width=147) (actual time=21.195..501.337 rows=457692 loops=1)
                    Join Filter: (phdrug_phdrug.id = monetary_drugprice.drug_id)
                    ->  Hash Join  (cost=696.51..1177.21 rows=4583 width=147) (actual time=21.180..43.205 rows=23577 loops=1)
                          Hash Cond: (phdrug_phdrug.id = t4.drug_id)
                          ->  Seq Scan on phdrug_phdrug  (cost=0.00..359.94 rows=11992 width=143) (actual time=0.491..5.225 rows=11992 loops=1)
                                Filter: active
                                Rows Removed by Filter: 2
                          ->  Hash  (cost=639.21..639.21 rows=4584 width=4) (actual time=20.589..20.589 rows=23577 loops=1)
                                Buckets: 32768 (originally 8192)  Batches: 1 (originally 1)  Memory Usage: 1085kB
                                ->  Nested Loop  (cost=3.99..639.21 rows=4584 width=4) (actual time=1.252..15.098 rows=23577 loops=1)
                                      ->  Nested Loop  (cost=3.58..9.11 rows=5 width=8) (actual time=1.182..1.292 rows=7 loops=1)
                                            ->  Hash Left Join  (cost=3.43..7.57 rows=5 width=4) (actual time=1.164..1.226 rows=7 loops=1)
                                                  Hash Cond: (monetary_pricelistdestinations.to_organization_data_id = organization_organizationdata.id)
                                                  Filter: (monetary_pricelistdestinations.to_all_insurances OR (organization_organization.uuid = 'b51773d4-05f8-43a2-86ef-0098b31725d8'::uuid))
                                                  Rows Removed by Filter: 130
                                                  ->  Seq Scan on monetary_pricelistdestinations  (cost=0.00..3.37 rows=137 width=9) (actual time=0.347..0.364 rows=137 loops=1)
                                                  ->  Hash  (cost=3.12..3.12 rows=25 width=20) (actual time=0.794..0.794 rows=25 loops=1)
                                                        Buckets: 1024  Batches: 1  Memory Usage: 10kB
                                                        ->  Hash Left Join  (cost=1.56..3.12 rows=25 width=20) (actual time=0.774..0.786 rows=25 loops=1)
                                                              Hash Cond: (organization_organizationdata.organization_id = organization_organization.id)
                                                              ->  Seq Scan on organization_organizationdata  (cost=0.00..1.25 rows=25 width=8) (actual time=0.317..0.319 rows=25 loops=1)
                                                              ->  Hash  (cost=1.25..1.25 rows=25 width=20) (actual time=0.432..0.432 rows=25 loops=1)
                                                                    Buckets: 1024  Batches: 1  Memory Usage: 10kB
                                                                    ->  Seq Scan on organization_organization  (cost=0.00..1.25 rows=25 width=20) (actual time=0.411..0.415 rows=25 loops=1)
                                            ->  Index Only Scan using monetary_pricelist_pkey on monetary_pricelist t5  (cost=0.14..0.31 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=7)
                                                  Index Cond: (id = monetary_pricelistdestinations.pricelist_id)
                                                  Heap Fetches: 7
                                      ->  Index Scan using monetary_drugprice_pricelist_id_1ce160ce on monetary_drugprice t4  (cost=0.42..110.21 rows=1581 width=8) (actual time=0.012..1.127 rows=3368 loops=7)
                                            Index Cond: (pricelist_id = t5.id)
                    ->  Index Scan using monetary_drugprice_drug_id_c2f278e5 on monetary_drugprice  (cost=0.42..0.78 rows=15 width=8) (actual time=0.002..0.012 rows=19 loops=23577)
                          Index Cond: (drug_id = t4.drug_id)
              ->  Hash  (cost=6.45..6.45 rows=91 width=4) (actual time=0.609..0.609 rows=93 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 12kB
                    ->  Seq Scan on monetary_pricelist  (cost=0.00..6.45 rows=91 width=4) (actual time=0.539..0.582 rows=93 loops=1)
                          Filter: (active AND ((NOT expires) OR (expires AND ((timezone('UTC'::text, datestart))::date <= '2019-01-22'::date) AND ((timezone('UTC'::text, dateend))::date >= '2019-01-22'::date))))
                          Rows Removed by Filter: 45
Planning time: 25.288 ms
Execution time: 4440.406 ms
Parhelion answered 23/1, 2019 at 8:19 Comment(2)
And how EXPLAIN ANALYZE output of the query without ORDER BY looks like? You said that you remove it from the query, can you show that?Selfgoverned
@KamilNiski Of course, I added it.Parhelion
S
1

The reason you get sorting in you EXPLAIN ANALYZE result is because in your query you are using SELECT DISTINCT so the db is sorting by all columns the rows and removes duplicates.

Review your query maybe you do not need to do distinct on all columns in the sql expression.

Apply DISTINCT to columns you need

Try replacing .distinct() with .distinct('pk')

Refer to Django distinct() docs

On PostgreSQL only, you can pass positional arguments (*fields) in order to specify the names of fields to which the DISTINCT should apply. This translates to a SELECT DISTINCT ON SQL query. Here’s the difference. For a normal distinct() call, the database compares each field in each row when determining which rows are distinct. For a distinct() call with specified field names, the database will only compare the specified field names.

Try changing sorting method

Your query plan indicates that sorting of rows takes place on disk instead in memory.

Sort Method: external merge  Disk: 31160kB

You can try solutions from another answer

Use LIMIT

Additionally, do you really need all of the data in one batch. Have you tried adding pagination?

Use GROUP BY

I've seen people try to convert their queries from DISTINCT to GROUP BY but i'm not sure if it is possible in your case.

Selfgoverned answered 23/1, 2019 at 10:58 Comment(4)
This is definitely improving the time the query takes to finish, but it's still far away from reasonably fast. Now the query looks like this and the EXPLAIN looks like this. It's still SORTing for more than a second. What else could be causing this SORT?Parhelion
Now sort is coming from ORDER BY "phdrug_phdrug"."id" ASC, "phdrug_phdrug"."default_description" ASC in your query, do really need to sort by default_description? Have you tried without it?Selfgoverned
Yes, I tried removing the entire ORDER BY statement. The SORT doesn't seem to be caused by the ORDER. Here is the query without the ORDER BY, and here is the EXPLAIN for that query.Parhelion
@Parhelion I've updated the answer, maybe these suggestions will help.Selfgoverned

© 2022 - 2024 — McMap. All rights reserved.