Average of top 5 value in a model
Asked Answered
V

2

8

I've got a django model with a lot of fields. I'm trying in a single query to get the avg value of a given field and the average value of the top 5 values of that same field (from my other question regarding pure SQL: Average of top 5 value in a table for a given group by). Not that it should matters but: my database is redshift.

I've found two different ways to achieve this in SQL, but I'm having trouble implementing those queries using django ORM

Here is an example of what I want to do using Cars:

class Cars(models.Model):
    manufacturer = models.CharField()
    model = models.CharField()
    price = models.FloatField()

Data:

manufacturer | model | price
Citroen        C1      1
Citroen        C2      2
Citroen        C3      3
Citroen        C4      4
Citroen        C5      5
Citroen        C6      6
Ford           F1      7
Ford           F2      8
Ford           F3      9
Ford           F4      10
Ford           F5      11
Ford           F6      12 
Ford           F6      19 
GenMotor       G1      20
GenMotor       G3      25
GenMotor       G4      22

Expected Output:

manufacturer | average_price | average_top_5_price
Citroen        3.5             4.0
Ford           10.85           12.2
GenMotor       22.33           22.33

Here are two pure SQL queries achieving the desired effect:

SELECT
    main.manufacturer,
    AVG(main.price) AS average_price,
    AVG(CASE WHEN rank <= 5 THEN main.price END) AS average_top_5_price
FROM (
    SELECT
        manufacturer,
        price,
        ROW_NUMBER() OVER (PARTITION BY manufacturer ORDER BY price DESC) AS rank
    FROM
        cars
) main
GROUP BY
    main.manufacturer;

And the second way of doing it:

SELECT A.manufacturer, A.avg_price, B.top5_price
FROM (
    SELECT manufacturer, AVG(price) as avg_price
    FROM cars
    GROUP BY manufacturer
) A
JOIN (
    SELECT manufacturer, AVG(psv_99) as top5_price
    FROM (
        SELECT manufacturer, price, RANK()
        OVER (PARTITION BY manufacturer ORDER BY price DESC, id)
        FROM cars
    )
    WHERE rank <= 5
    GROUP BY manufacturer
) B
ON A.manufacturer = B.manufacturer
ORDER BY manufacturer

So far I haven't manage to implement either of those queries using django ORM, for the first one I can't find a way to have django do a "select from subquery" for the second one I can't find a good way to force django to "join two subquery"

PS: Keep in mind that I have reduced my table to three fields to simplify solving that particular problem, but I have ~100 column in my real table on which I'm making different calculation in the same queries.

Vivian answered 1/2 at 13:3 Comment(1)
Re "not that it should matter but my database is...": it's actually very important. I wanted to suggest django-cte that solves exactly your problem, but it supports only Postgres.Oden
A
0

You can use a combination of values and annotate to group by manufacturer then calculate the group's average with Avg.

The average_price is quite easy to calculate:

from django.db.models import Avg
from django.db.models.functions import Round

averages =
Car.objects.values("manufacturer").annotate(average_price=Round(Avg("price"), precision=2))

But to calculate the top five in each group, it gets a little complicated (I think). For that, you need a Subquery. So, the full code would be:


from django.db.models import Subquery, OuterRef, Avg, Q
from django.db.models.functions import Round

group_top_5 = Car.objects.filter(manufacturer=OuterRef("manufacturer")).order_by("-price")[:5].values("price")

query_filter = Q(price__in=group_top_5)
averages = (
Car.objects.values("manufacturer")
.annotate(
average_price=Round(Avg("price"), precision=2), 
average_top_5_price=Round(Avg("price", filter=query_filter), precision=2))

)

This should give you:

{'manufacturer': 'Citroen', 'average_price': 3.5, 'average_top_5_price': 4.0}
{'manufacturer': 'Ford', 'average_price': 10.86, 'average_top_5_price': 12.2}
{'manufacturer': 'GenMotor', 'average_price': 22.33, 'average_top_5_price': 22.33}

Actress answered 2/2 at 8:17 Comment(3)
Hello, thank you for your answer, this solution is more or less my first attempt, but unfortunately , the database doesn't support "This type of correlated subquery pattern", the problem is django generate a subquery in a "case when" which the database is not happy about. See my original question in pure SQL for the detailVivian
That's really unfortunate.Actress
I'm thinking I probably should write a bug report on that since django should not generate an invalid SQLVivian
A
0

The conditional aggregation query would have been the best but unfortunately I could not translate it to Django. Here is my attempt at conversion:

from carsdemo.models import Cars
from django.db.models import Avg, F, RowRange, Window
from django.db.models.functions import RowNumber

result = (
    Cars.objects.values("manufacturer")
    .annotate(
        average_price=Window(
            expression=Avg("price"),
            partition_by=F("manufacturer"),
        ),
        average_top_5_price=Window(
            expression=Avg("price"),
            partition_by=F("manufacturer"),
            order_by=F("price").desc(),
            frame=RowRange(start=0, end=4),
        ),
        rn=Window(
            expression=RowNumber(),
            partition_by=F("manufacturer"),
            order_by=F("price").desc(),
        ),
    )
    .filter(rn=1)
)

It produces the following query (on SQL Server):

SELECT *
FROM (
    SELECT
        carsdemo_cars.manufacturer AS Col1,
        AVG(carsdemo_cars.price) OVER (PARTITION BY carsdemo_cars.manufacturer ORDER BY (SELECT NULL)) AS average_price,
        AVG(carsdemo_cars.price) OVER (PARTITION BY carsdemo_cars.manufacturer ORDER BY carsdemo_cars.price DESC ROWS BETWEEN CURRENT ROW AND 4 FOLLOWING) AS average_top_5_price,
        ROW_NUMBER() OVER (PARTITION BY carsdemo_cars.manufacturer ORDER BY carsdemo_cars.price DESC) AS rn
    FROM carsdemo_cars
) qualify
WHERE rn = 1

And the result:

<QuerySet [
{'manufacturer': 'Citroen', 'average_price': 3.5, 'average_top_5_price': 4.0, 'rn': 1},
{'manufacturer': 'Ford', 'average_price': 10.857142857142858, 'average_top_5_price': 12.2, 'rn': 1},
{'manufacturer': 'GenMotor', 'average_price': 22.333333333333332, 'average_top_5_price': 22.333333333333332, 'rn': 1}
]>
Asci answered 6/2 at 14:19 Comment(5)
Hi, thank you for taking the time to answer, unfortunately my DBMS doesn't seem to support this type of queries. Like with all previous attempts, I get "NotSupportedError: This type of correlated subquery pattern is not supported yet".Vivian
Does the raw query (posted in answer) work in your RDBMS? You will have to replace TOP 1 with OFFSET 0 ROWS FETCH FIRST 1 ROW ONLY. If so, try upgrading your libraries.Asci
The raw query doesn't work, the error is coming directly from the database. But the raw queries in the question do workVivian
See revised answer, give it a try.Asci
Thank you again for the time invested, maybe my version of django is too old: I get django.db.utils.NotSupportedError: Window is disallowed in the filter clause.Vivian

© 2022 - 2024 — McMap. All rights reserved.