django select_related - when to use it
Asked Answered
A

5

51

I'm trying to optimize my ORM queries in django. I use connection.queries to view the queries that django generate for me.

Assuming I have these models:

class Book(models.Model):
    name   = models.CharField(max_length=50)
    author = models.ForeignKey(Author)

class Author(models.Model):
    name   = models.CharField(max_length=50)

Let's say, when I generate a specific webpage, I want to display all books, with their author name next to each of them. Also, I display seperately all the authors.

So should I use

Book.objects.all().select_related("author")

Which will result in a JOIN query. Even if I do a line before:

Author.objects.all()

Obviously in template I will write something like {{book.author.name}}.
So the question is, when I access a foreign key value (author), if django already has that object from another query, will that still result in additional query (for each book)? If no, so in that case, does using select_related actually creates performance overhead?

Apoplectic answered 20/10, 2015 at 7:29 Comment(0)
S
40

You are actually asking two different questions:

1. does using select_related actually creates performance overhead?

You should see documentation about Django Query Cache:

Understand QuerySet evaluation

To avoid performance problems, it is important to understand:

  • that QuerySets are lazy.

  • when they are evaluated.

  • how the data is held in memory.

So in summary, Django caches in memory results evaluated within the same QuerySet object, that is, if you do something like that:

books = Book.objects.all().select_related("author")
for book in books:
    print(book.author.name)  # Evaluates the query set, caches in memory results
first_book = books[1]  # Does not hit db
print(first_book.author.name)  # Does not hit db  

Will only hit db once as you prefetched Authors in select_related, all this stuff will result in a single database query with INNER JOIN.

BUT this won't do any cache between querysets, nor even with the same query:

books = Book.objects.all().select_related("author")
books2 = Book.objects.all().select_related("author")
first_book = books[1]  # Does hit db
first_book = books2[1]  # Does hit db

This is actually pointed out in docs:

We will assume you have done the obvious things above. The rest of this document focuses on how to use Django in such a way that you are not doing unnecessary work. This document also does not address other optimization techniques that apply to all expensive operations, such as general purpose caching.

2. if django already has that object from another query, will that still result in additional query (for each book)?

You are actually meaning if Django does ORM queries caching, which is a very different matter. ORM Queries caching, that is, if you do a query before and then you do the same query later, if database hasn't changed, the result is coming from a cache and not from an expensive database lookup.

The answer is not Django, not officially supported, but yes unofficially, yes through 3rd-party apps. The most relevant third-party apps that enables this type of caching are:

  1. Johnny-Cache (older, not supporting django>1.6)
  2. Django-Cachalot (newer, supports 1.6, 1.7, and still in dev 1.8)
  3. Django-Cacheops (newer, supports Python 2.7 or 3.3+, Django 1.8+ and Redis 2.6+ (4.0+ recommended))

Take a look a those if you look for query caching and remember, first profile, find bottlenecks, and if they are causing a problem then optimize.

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming. Donald Knuth.

Sasha answered 20/10, 2015 at 8:3 Comment(3)
I don't think I meant to entire queries caching. I meant that since I have all the authors in the previous query (as I did Author.objects.all() ), so doing later book.author.name should not result in a query, since django can somehow fetch it from the authors queryset. But I realize it does not work.Apoplectic
Well, you confirm you "expect" query caching. Django caches at QuerySet level, if you have two objects, each has a different cache. If you say that you perform a Author.objects.all(), without the select_related, it will result in a query, because the FK is lazy. If you add select_related it won't hit db again using same QuerySet object. If you do another Author.objects.all() in another point of the code, previous results are not cached, and will be evaluated again in then new QuerySet. Maybe the question is not clear enough when you say you've done a Author.objects.all() before.Sasha
I think select_related does left outer join instead of inner join.Jobie
L
27

Django doesn't know about other queries! Author.objects.all() and Book.objects.all() are totally different querysets. So if have both in your view and pass them to template context but in your template you do something like:

{% for book in books %}
  {{ book.author.name }}
{% endfor %}

and have N books this will result to N extra database queries (beyond the queries to get all books and authors) !

If instead you had done Book.objects.all().select_related("author") no extra queries will be done in the above template snippet.

Now, select_related() of course adds some overhead to the queries. What happens is that when you do a Book.objects.all() django will return the result of SELECT * FROM BOOKS. If instead you do a Book.objects.all().select_related("author") django will return the result of SELECT * FROM BOOKS B LEFT JOIN AUTHORS A ON B.AUTHOR_ID = A.ID. So for each book it will return both the columns of the book and its corresponding author. However, this overhead is really much smaller when compared to the overhead of hitting the database N times (as explained before).

So, even though select_related creates a small performance overhead (each query returns more fields from the database) it will actually be beneficial to use it except when you are totally sure that you'll need only the columns of the specific model you are querying.

Finally, a great way to really see how many (and which exactly) queries are actuall exectuted in your database is to use django-debug-tooblar (https://github.com/django-debug-toolbar/django-debug-toolbar).

Larrainelarrie answered 20/10, 2015 at 7:52 Comment(4)
Thanks. If my Author object is somewhat bigger, means it has some more fields, and I only need the author's name to be selected with the book. Is it will be legal, and better in performance to write: Book.objects.all().select_related("author_name").Apoplectic
Hmm I think yes it will legal and better since you'll get only the needed field. However please try it like this Book.objects.all().select_related('author__name'). Also please check my updateLarrainelarrie
I dont understand why you do .all()Create
It does raise another question, is there any difference where do I use .all() ? Means, is there any difference Between: Book.objects.all().select_related("author"), or Book.objects.select_related("author").all(). Or actually all() should be avoided at all? Since I read that all() is a callable hence always hit the DB.Apoplectic
C
7
Book.objects.select_related("author")

is good enough. No need for Author.objects.all()

{{ book.author.name }}

won't hit the database, because book.author has been prepopulated already.

Create answered 20/10, 2015 at 7:48 Comment(3)
I know that. But I wonder if that JOIN query would be a waste, considering the fact I also query all the authors. I DO need Author.objects.all() anyway, as I said, it is printed separately in the same page.Apoplectic
@Apoplectic if you want to show the author of the book, then you need select_related() otherwise you have more db hits.Create
If you use Author.objects.all(), it will also hit the db, thus increasing execution time.Cycloplegia
G
7

Select_related

select_related is an optional performance booster by which further access to the property of foreign_keys in a Queryset, won't hit the database.

Design philosophies

This is also why the select_related() QuerySet method exists. It’s an optional performance booster for the common case of selecting “every related object.”

Django official doc

Returns a QuerySet that will “follow” foreign-key relationships, selecting additional related-object data when it executes its query. This is a performance booster which results in a single more complex query but means later use of foreign-key relationships won’t require database queries.

As pointed in the definition, using select_related only is allowed in foreign_key relationships. Ignoring this rule will face you with below exception:

In [21]: print(Book.objects.select_related('name').all().query)

FieldError: Non-relational field given in select_related: 'name'. Choices are: author

Let's dive into it with an example:

Here are my models.py. (It's the same as Question asked)

from django.db import models


class Author(models.Model):
    name = models.CharField(max_length=50)

    def __str__(self):
        return self.name

    __repr__ = __str__


class Book(models.Model):
    name = models.CharField(max_length=50)
    author = models.ForeignKey(Author, related_name='books', on_delete=models.DO_NOTHING)

    def __str__(self):
        return self.name

    __repr__ = __str__
  • Fetching all books and their author's using the relect_related booster:
In [25]: print(Book.objects.select_related('author').all().explain(verbose=True, analyze=True))
Hash Join  (cost=328.50..548.39 rows=11000 width=54) (actual time=3.124..8.013 rows=11000 loops=1)
  Output: library_book.id, library_book.name, library_book.author_id, library_author.id, library_author.name
  Inner Unique: true
  Hash Cond: (library_book.author_id = library_author.id)
  ->  Seq Scan on public.library_book  (cost=0.00..191.00 rows=11000 width=29) (actual time=0.008..1.190 rows=11000 loops=1)
        Output: library_book.id, library_book.name, library_book.author_id
  ->  Hash  (cost=191.00..191.00 rows=11000 width=25) (actual time=3.086..3.086 rows=11000 loops=1)
        Output: library_author.id, library_author.name
        Buckets: 16384  Batches: 1  Memory Usage: 741kB
        ->  Seq Scan on public.library_author  (cost=0.00..191.00 rows=11000 width=25) (actual time=0.007..1.239 rows=11000 loops=1)
              Output: library_author.id, library_author.name
Planning Time: 0.234 ms
Execution Time: 8.562 ms

In [26]: print(Book.objects.select_related('author').all().query)
SELECT "library_book"."id", "library_book"."name", "library_book"."author_id", "library_author"."id", "library_author"."name" FROM "library_book" INNER JOIN "library_author" ON ("library_book"."author_id" = "library_author"."id")

As you can see, using select_related cause an INNER JOIN on the provided foreign keys(Here was author).

The execution time which the time of:

  • Running the query using the quickest plan which has been chosen by the planner
  • Returning the results

Is 8.562 ms

On the other hand:

  • Fetching all books and their author's without using the relect_related booster:
In [31]: print(Book.objects.all().explain(verbose=True, analyze=True))
Seq Scan on public.library_book  (cost=0.00..191.00 rows=11000 width=29) (actual time=0.017..1.349 rows=11000 loops=1)
  Output: id, name, author_id
Planning Time: 1.135 ms
Execution Time: 2.536 ms

In [32]: print(Book.objects.all().query)
SELECT "library_book"."id", "library_book"."name", "library_book"."author_id" FROM "library_book

As you can see, It's just a simple SELECT query on book models that only contains author_id. The execution time, in this case, is 2.536 ms.

As mentioned in the Django doc:

Further access to the foreign-key properties will cause another hit on the database: (CUZ we don't have them already)

In [33]: books = Book.objects.all()

In [34]: for book in books:
    ...:     print(book.author) # Hit the database

See Also Database access optimization and explain() in QuerySet API reference

Django Database Caching:

Django comes with a robust cache system that lets you save dynamic pages so they don’t have to be calculated for each request. For convenience, Django offers different levels of cache granularity: You can cache the output of specific views, you can cache only the pieces that are difficult to produce, or you can cache your entire site.

Django also works well with “downstream” caches, such as Squid and browser-based caches. These are the types of caches that you don’t directly control but to which you can provide hints (via HTTP headers) about which parts of your site should be cached, and how.

You should read those docs to find out which of them suits you most.


PS1: for gaining further information about the planner and how it's work see Why Planing time and Execution time are so different Postgres? and Using EXPLAIN)


Gentilesse answered 11/9, 2020 at 18:9 Comment(0)
F
0

In general, select_related should be used when the related data is small and will be accessed frequently. Prefetch_related should be used when the related data is large or will only be accessed occasionally.

Fernferna answered 22/4, 2024 at 10:22 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.