Using Subquery to annotate a Count
Asked Answered
N

3

12

Please help me I've been stuck on this for way too long :(

What I want to do:

I have these two models:

class Specialization(models.Model):
    name = models.CharField("name", max_length=64)
class Doctor(models.Model):
    name = models.CharField("name", max_length=128)
    # ...
    specialization = models.ForeignKey(Specialization)

I would like to annotate all specializations in a queryset with the number of doctors that have this specialization.

My Solution so Far:

I went through a loop and I made a simple: Doctor.objects.filter(specialization=spec).count() however this proved to be too slow and inefficient. The more I've read the more I realized that it would make sense to use a SubQuery here to filter the doctors for the OuterRef specialization. This is what I came up with:

doctors = Doctor.objects.all().filter(specialization=OuterRef("id")) \
    .values("specialization_id") \
    .order_by()
add_doctors_count = doctors.annotate(cnt=Count("specialization_id")).values("cnt")[:1]

spec_qs_with_counts = Specialization.objects.all().annotate(
    num_applicable_doctors=Subquery(add_doctors_count, output_field=IntegerField())
)

The output I get is just 1 for every speciality. The code just annotates every doctor object with its specialization_id and then annotates the count within that group, meaning it will be 1.

This doesn't make complete sense to me unfortunately. In my initial attempt I used an aggregate for the count, and while it works on its own it doesn't work as a SubQuery, I get this error:

This queryset contains a reference to an outer query and may only be used in a subquery.

I posted this question before and someone suggested doing Specialization.objects.annotate(count=Count("doctor"))

However this doesn't work because I need to count a specific queryset of Doctors.

I've followed these links

However, I'm not getting the same result:

If you have any questions that would make this clearer please tell me.

Nynorsk answered 26/8, 2018 at 15:26 Comment(1)
I think you misunderatnd the semantics of Count('doctor'): it counts the related doctors, so that means per specialization, you will get a (possibly) different number. It does not count all doctors, and annotates all specializations with that number.Mawkin
M
11

Counting all Doctors per Specialization

I think you make things overly complicated, probably because you think that Count('doctor') will count every doctor per specialization (regardless the specialization of that doctor). It does not, if you Count such related object, Django implicitly looks for related objects. In fact you can not Count('unrelated_model') at all, it is only through relations (reversed included) like a ForeignKey, ManyToManyField, etc. that you can query these, since otherwise these are not very sensical.

I would like to annotate all specializations in a queryset with the number of doctors that have this specialization.

You can do this with a simple:

#  Counting all doctors per specialization (so not all doctors in general)

from django.db.models import Count

Specialization.objects.annotate(
    num_doctors=Count('doctor')
)

Now every Specialization object in this queryset will have an extra attribute num_doctors that is an integer (the number of doctors with that specialization).

You can also filter on the Specializations in the same query (for example only obtain specializations that end on 'my'). As long as you do not filter on the related doctors set, the Count will work (see section below how to do this).

If you however filter on the related doctors, then the related counts will filter out these doctors. Furthermore if you filter on another related object, then this will result in an extra JOIN, that will act as a multiplier for the Counts. In that case it might be better to use num_doctors=Count('doctor', distinct=True) instead. You can always use the distinct=True (regardless if you do extra JOINs or not), but it will have a small performance impact.

The above works because Count('doctor') does not simply adds all doctors to the query, it makes a LEFT OUTER JOIN on the doctors table, and thus checks that the specialization_id of that Doctor is exactly the one we are looking for. So the query Django will construct looks like:

SELECT specialization.*
       COUNT(doctor.id) AS num_doctors
FROM specialization
LEFT OUTER JOIN doctor ON doctor.specialization_id = specialization.id
GROUP BY specialization.id

Doing the same with a subquery will functionally get the same results, but if the Django ORM and the database management system do not find a way to optimize this, this can result in an expensive query, since for every specialization, it then can result in an extra subquery in the database.

Counting specific Doctors per Specialization

Say however you want to count only doctors that have a name that starts with Joe, then you can add a filter on the related doctor, like:

#  counting all Doctors with as name Joe per specialization

from django.db.models import Count

Specialization.objects.filter(
    doctor__name__startswith='Joe'  # sample filter
).annotate(
    num_doctors=Count('doctor')
)
Mawkin answered 26/8, 2018 at 15:31 Comment(6)
Thanks for the reply, I appreciate it. I just tried this and initially thought this wasn't going to work, but it did! You're right I was overcomplicating this, I just thought the count for the doctors is going to count ALL the doctors for that Specialization, but the filter actually takes them out of the count. Thanks for the help.Nynorsk
The problem with annotate + Count is that (as I have discovered), Django generates a left outer join and then selects a COUNT(DISTINCT).. which can be VERY SLOW because so many intermediate rows are generated in the result setAllegedly
@little_birdie: yes, but the alternative of doing it one per element is slower, since then this yields a round trip per subquery. The idea of an annotate is that you do this if you need the count of all Specializations, in that case this is, querywise more interesting than counting one per Specialization, since this is the famous N+1 problem.Mawkin
I'm not suggesting multiple round trips. I'm suggesting that the query which django generates when you annotate(Count()) on a reverse join can be VERY slow, especially when you are doing multiple counts.. because it does an outer join to every table and then does a COUNT(DISTINCT).. this results in a huge number of rows for the server to sort through. In my application, it took 5 seconds for postgres to execute django's query whereas my hand coded query which does (SELECT COUNT(*) .. ) AS acount executes in 34 milliseconds.Allegedly
@little_birdie: but here we do not COUNT(DISTINCT ...) note that there is no Count(..., distinct=True) in the ORM query, you can obtain the query with print(str(my_django_query.query)), and I've tested in on both a MySQL and PostgreSQL, and both result, as expected in a COUNT(doctor_id) (as listed in the answer). Especially if there would be multiple aggregates involved, a JOIN will eventually outperform a subquery, since it performs multiple scans (source: itprotoday.com/sql-server/…)Mawkin
The DISTINCT is key here, since for a reasonable modern database system, the query plan will "collapse" the count to an accumulator, or it can even make use of the MUL index.Mawkin
H
34

Problem

The problem is that Django adds GROUP BY as soon as it sees using an aggregate function.

Solution

So you can just create your own aggregate function but so that Django thinks it is not aggregate. Just like this:

doctors = Doctor.objects.filter(
    specialization=OuterRef("id")
).order_by().annotate(
    count=Func(F('id'), function='Count')
).values('count')

spec_qs_with_counts = Specialization.objects.annotate(
    num_applicable_doctors=Subquery(doctors)
)

The more details on this method you can see in this answer: https://mcmap.net/q/159876/-django-1-11-annotating-a-subquery-aggregate

Useful information can also be found in the docs about using aggregates within a subquery expression and func expressions.

Heckle answered 2/9, 2021 at 13:15 Comment(3)
is there any difference in performance between both methods? i prefer this way as it avoids unwanted group byChivalric
It's more related to unperidictable query. Without Func expression , It makes different result. Func(F('id'),function='Count') Count('id') Those are totally diffrent. I faced same problem.Transpontine
You may want to make use of output_field on Func. It seems like Django infers the output field from the column being counted (id in this case). id isn't always an integer. In my case, it was a UUID field.Gusella
M
11

Counting all Doctors per Specialization

I think you make things overly complicated, probably because you think that Count('doctor') will count every doctor per specialization (regardless the specialization of that doctor). It does not, if you Count such related object, Django implicitly looks for related objects. In fact you can not Count('unrelated_model') at all, it is only through relations (reversed included) like a ForeignKey, ManyToManyField, etc. that you can query these, since otherwise these are not very sensical.

I would like to annotate all specializations in a queryset with the number of doctors that have this specialization.

You can do this with a simple:

#  Counting all doctors per specialization (so not all doctors in general)

from django.db.models import Count

Specialization.objects.annotate(
    num_doctors=Count('doctor')
)

Now every Specialization object in this queryset will have an extra attribute num_doctors that is an integer (the number of doctors with that specialization).

You can also filter on the Specializations in the same query (for example only obtain specializations that end on 'my'). As long as you do not filter on the related doctors set, the Count will work (see section below how to do this).

If you however filter on the related doctors, then the related counts will filter out these doctors. Furthermore if you filter on another related object, then this will result in an extra JOIN, that will act as a multiplier for the Counts. In that case it might be better to use num_doctors=Count('doctor', distinct=True) instead. You can always use the distinct=True (regardless if you do extra JOINs or not), but it will have a small performance impact.

The above works because Count('doctor') does not simply adds all doctors to the query, it makes a LEFT OUTER JOIN on the doctors table, and thus checks that the specialization_id of that Doctor is exactly the one we are looking for. So the query Django will construct looks like:

SELECT specialization.*
       COUNT(doctor.id) AS num_doctors
FROM specialization
LEFT OUTER JOIN doctor ON doctor.specialization_id = specialization.id
GROUP BY specialization.id

Doing the same with a subquery will functionally get the same results, but if the Django ORM and the database management system do not find a way to optimize this, this can result in an expensive query, since for every specialization, it then can result in an extra subquery in the database.

Counting specific Doctors per Specialization

Say however you want to count only doctors that have a name that starts with Joe, then you can add a filter on the related doctor, like:

#  counting all Doctors with as name Joe per specialization

from django.db.models import Count

Specialization.objects.filter(
    doctor__name__startswith='Joe'  # sample filter
).annotate(
    num_doctors=Count('doctor')
)
Mawkin answered 26/8, 2018 at 15:31 Comment(6)
Thanks for the reply, I appreciate it. I just tried this and initially thought this wasn't going to work, but it did! You're right I was overcomplicating this, I just thought the count for the doctors is going to count ALL the doctors for that Specialization, but the filter actually takes them out of the count. Thanks for the help.Nynorsk
The problem with annotate + Count is that (as I have discovered), Django generates a left outer join and then selects a COUNT(DISTINCT).. which can be VERY SLOW because so many intermediate rows are generated in the result setAllegedly
@little_birdie: yes, but the alternative of doing it one per element is slower, since then this yields a round trip per subquery. The idea of an annotate is that you do this if you need the count of all Specializations, in that case this is, querywise more interesting than counting one per Specialization, since this is the famous N+1 problem.Mawkin
I'm not suggesting multiple round trips. I'm suggesting that the query which django generates when you annotate(Count()) on a reverse join can be VERY slow, especially when you are doing multiple counts.. because it does an outer join to every table and then does a COUNT(DISTINCT).. this results in a huge number of rows for the server to sort through. In my application, it took 5 seconds for postgres to execute django's query whereas my hand coded query which does (SELECT COUNT(*) .. ) AS acount executes in 34 milliseconds.Allegedly
@little_birdie: but here we do not COUNT(DISTINCT ...) note that there is no Count(..., distinct=True) in the ORM query, you can obtain the query with print(str(my_django_query.query)), and I've tested in on both a MySQL and PostgreSQL, and both result, as expected in a COUNT(doctor_id) (as listed in the answer). Especially if there would be multiple aggregates involved, a JOIN will eventually outperform a subquery, since it performs multiple scans (source: itprotoday.com/sql-server/…)Mawkin
The DISTINCT is key here, since for a reasonable modern database system, the query plan will "collapse" the count to an accumulator, or it can even make use of the MUL index.Mawkin
T
0

Here is the efficient way

specialization_with_doctor_counts = Specialization.objects.annotate(doctor_count=Count('doctor')).order_by('-doctor_count')
Tinct answered 6/5 at 7:41 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Disoperation

© 2022 - 2024 — McMap. All rights reserved.