Can I have SQLAlchemy do subquery eager loading without repeating the full original query?
Asked Answered
A

4

12

Suppose we have original generated query like that:

SELECT company.x AS company_x, ...
FROM company
LEFT OUTER JOIN acc ON acc.id = company.acc
LEFT OUTER JOIN usercomp_links ON company.id = usercomp_links.pid
LEFT OUTER JOIN usergro_links ON acc.id = usergro_links.pid
WHERE usergro_links.eid = %s OR usercomp_links.eid = %s

And if we add .options(subqueryload(Company.childs)) to this, we will get:

SELECT company.x AS company_x, ..., anon_1.company_id AS anon_1_company_id
FROM (
    SELECT company.id AS company_id
    FROM company
    LEFT OUTER JOIN acc ON acc.id = company.acc
    LEFT OUTER JOIN usercomp_links ON company.id = usercomp_links.pid
    LEFT OUTER JOIN usergro_links ON acc.id = usergro_links.pid
    WHERE usergro_links.eid = %s OR usercomp_links.eid = %s) AS anon_1
INNER JOIN acel_links AS acel_links_1 ON anon_1.company_id = acel_links_1.eid
INNER JOIN company ON company.id = acel_links_1.pid ORDER BY anon_1.company_id

And this is sloooow. If I will get company ids from first query, and load all child companies by hands, it will be blazingly fast in compare to what we get in this case.

I have read documentation, looked at the code, but dont see if I can tell sqlalchemy just get ids from results of first query and load childs in separate, comparatively simple query. I dont rely on this sample - I had more more harder situations, when sqlalchemy just cant load constructed query. And why to do all this work from first query one more time?

So anyone know how to eager load without auto constructed "join from join in join" style?

Annulation answered 2/11, 2014 at 20:33 Comment(12)
As an example, Eloquent from Laravel (a PHP framework) does eager loading by merely loading the related records directly by ID. E.g. fetching books: select * from books, fetching authors of books: select * from authors where id in (1, 2, 3, 4, 5, ...) (the books query already gave us the author IDs!)Hulsey
@Agop: Which RDBMS are you using?Arleyne
@Agop: Do you have similar query as the OP? Specifically, do you have an OR statement there? Or are you really looking for another implementation of eager loading technique?Arleyne
@van: Postgres, although the engine shouldn't matter. SQLAlchemy is embedding the original query as a subquery for the eager load. Regardless of the OR statement, this is extremely inefficient, especially when the original query is somewhat slow on its own.Hulsey
@van: Ultimately, my goal is to avoid the subquery altogether. Once the original query runs, we should have the IDs of the related objects. The second query (to eagerly load those related objects) should be a very simple form of WHERE id IN (/* relationship IDs from original query */). This is how other ORMs do it (see my Laravel Eloquent example).Hulsey
@Hulsey maybe I misunderstood a problem, but why joinedload is not a solution?Mauramauralia
@Agop: Ideally, the solution might be to actually implement this type of subquery using IN (...). It sounds like it could be implemented, and the fact that it is not indicates to me that the demand for such loading strategy is really small and/or there are some drawbacks of such implementation. In any event, instead of blaming sqlalchemy for producing slow query, I would suggest you to take this query and put it through query optimiser. I suspect that with 1-2 database indices the query will become really fast. Would be great if you did that and reported back. Thank you.Arleyne
@van: Right, you could optimize the query. The point is, it's still duplicated. SQLAlchemy runs it once to give you results, then runs it again as a subquery to get the related records. The question is: Why run it again? If it was already run, and we got the relationship IDs, why run the same query again?Hulsey
@Agop: I do not think your statement is actually correct. The same query is not ran twice. It is a variation of the original query which used as a sub-query in the second query. And in the case of the OP (and/or your case) the performance of this second query is quite bad, whereas the performance of the original query is good (enough).Arleyne
... Therefore, I read your question as: *How do I make it performing well?". In this context, the most pragmatic approach would be to try DB indices. / You have, however, immediately decided that using another strategy (using IN (...) filter) would be definitely better (faster). It looks logical, and I tend to agree that indeed this would be the case. However, we do not know for sure. Also, I do remember that IN.. (as well as OR) was a bad performance contributor in SQL queries. / If you really need different implementation, you can implement it. I am sure, Mike will help you.Arleyne
@van: I think you're completely missing the point. Yes, the query that runs as a subquery is a (slight) variation of the original query. The point is, it still has all of the same conditions as the original query (all joins, all where clauses, etc.). This is unnecessary - the original query already gave us all of the relationship IDs, so why repeat it (or a slight variation of it)?Hulsey
@Agop: i think i do get it - you are right, in the world without caching this implementation would be superior. However, I think that given proper set of indices most RDBMS will be very-very efficient and might even re-use the results of the sub-query from in-memory, and will not even execute the same query again. Therefore, i am not sure that using caching on python level will add to already existing caching on the DB level. / And again, sqlalchemy is an open-source project - would be great if you could contribute your implementation of this loading strategy. Happy to help where i canArleyne
A
3

http://docs.sqlalchemy.org/en/latest/orm/loading_relationships.html#sqlalchemy.orm.selectinload

It was added to sqlalchemy, so now you can just use selectinload strategy.

Annulation answered 3/5, 2018 at 16:26 Comment(0)
M
17

Update: the "select in" strategy is now implemented in SQLAlchemy (since v 1.2): see Select IN loading in the documentation.

TLDR:

I think the joinedload strategy should be used where possible, as it is more efficient than other strategies, including the suggested in the question strategy to load related data using the "IN" statement.

The "IN" strategy can be easily enough implemented "outside" of the SQLAlchemy (see the code below) and probably it should not be complex to implement it as a new loading strategy (as logically it is similar to existing subqueryload strategy).

Full version:

I started with a simple experiment to see the queries produced by different strategies

The full source code of the experiment is on Github.

My models look this way:

class Author(ModelBase):
    __tablename__ = 'authors'
    id = Column(Integer, primary_key=True, nullable=False)
    name = Column(String(255))


class Book(ModelBase):
    __tablename__ = 'books'
    id = Column(Integer, primary_key=True)
    name = Column(String)
    author_id = Column(Integer, ForeignKey('authors.id'))
    author = relationship(
        'Author', backref=backref('books'))

Now, the tests, first is lazy loading:

books = session.query(Book).all()
print books[0].author.name
session.commit()

Output (cleaned up):

-------------Lazy--------------
sqlalchemy.engine.base.Engine:
SELECT
  books.id AS books_id, books.name AS books_name, books.author_id AS books_author_id
FROM books

SELECT
  authors.id AS authors_id, authors.name AS authors_name
FROM authors
WHERE authors.id = ?
INFO:sqlalchemy.engine.base.Engine:(1,)
author1

As expected, lazy loading runs one query to fetch books and one query each time we access an author.

Subquery loading:

books = session.query(Book).options(subqueryload(Book.author)).all()
print books[0].author.name
session.commit()

-------------Subquery----------
SELECT
  books.id AS books_id, books.name AS books_name, books.author_id AS books_author_id
FROM books

SELECT
  authors.id AS authors_id, authors.name AS authors_name,
  anon_1.books_author_id AS anon_1_books_author_id
FROM (
  SELECT DISTINCT books.author_id AS books_author_id
  FROM books) AS anon_1
JOIN authors
  ON authors.id = anon_1.books_author_id
ORDER BY anon_1.books_author_id
author1

For the subquery, we have two queries, first fetches books and another fetches authors using the subquery.

Joined loading:

books = session.query(Book).options(joinedload(Book.author)).all()
print books[0].author.name
session.commit()

-------------Joined------------
SELECT
  books.id AS books_id, books.name AS books_name,
  books.author_id AS books_author_id,
  authors_1.id AS authors_1_id, authors_1.name AS authors_1_name
FROM books
LEFT OUTER JOIN authors AS authors_1 ON authors_1.id = books.author_id
author1

The joined strategy runs just one query to fetch both books and authors.

Immediate loading:

books = session.query(Book).options(immediateload(Book.author)).all()
print books[0].author.name
session.commit()

-------------Immediate---------
SELECT
   books.id AS books_id, books.name AS books_name, books.author_id AS books_author_id
FROM books

SELECT
  authors.id AS authors_id, authors.name AS authors_name
FROM authors
WHERE authors.id = ?
INFO:sqlalchemy.engine.base.Engine:(1,)

SELECT authors.id AS authors_id, authors.name AS authors_name
FROM authors
WHERE authors.id = ?
INFO:sqlalchemy.engine.base.Engine:(2,)

author1

And the immediate strategy loads books with the first query and then, when we try to access the relation, fetches all the related data with separate query for each related record.

It looks that "joinedload()" should be most efficient in most cases (amd more efficient than "IN" strategy) - we just get all the data with single query.

Now, lets try to implement the IN strategy outside of SQL alchemy:

print '-------------IN----------------'
books = session.query(Book).all()
ids = set()
for b in books:
    ids.add(b.author_id)
authors = session.query(Author).filter(Author.id.in_(ids)).all()
print books[0].author.name
print books[1].author.name
print books[2].author.name
print books[3].author.name

Output:

-------------IN----------------
SELECT
  books.id AS books_id, books.name AS books_name, books.author_id AS books_author_id
FROM books

SELECT authors.id AS authors_id, authors.name AS authors_name
FROM authors
WHERE authors.id IN (?, ?)
INFO:sqlalchemy.engine.base.Engine:(1, 2)

author1
author1
author2
author2

As we can see, it runs two queries and then we can access all the authors.

Note that we don't join authors to books explicitly, but it still works when we try to access authors through the books, since SQLAlchemy finds author records in the internal identity map and doesn't run additional DB queries.

The "IN" strategy code similar to above can be generalized into the function which can be used with any model / relation. And probably, the "IN" strategy should be relatively easy to implement as a new SQLAlchemy strategy, it is similar to the existing subqueryloading - it also should run the second query to get the related data.

Maremma answered 22/8, 2016 at 7:43 Comment(6)
Great answer. I wonder if this sqlalchemy way of automatically linking related objects works in all cases without going to the database though.Arleyne
@Arleyne yes, until you close the session, SQLAlchemy maintains the "identity map" with all the objects and it fetches objects from there, skipping the database, see docs.sqlalchemy.org/en/latest/orm/session_basics.htmlMaremma
Awesome, exactly the type of answer that I was hoping someone would write. I previously came to the same conclusions myself, but it really helps to have a comprehensive answer like this for anyone that may run into the same question in the future (especially people coming from other languages and frameworks that may find some of the SQLAlchemy eager loading options confusing). Long story short: just use the joinedload strategy, and if you really want to use a second query with IN instead, write a simple helper function to do it.Hulsey
Brilliant, but you can get all this secret knowledge from docs and this is not an answer. If you can read question carefully, you can see that current strategies not always is an answer.Annulation
And about can easily implement — show me, because I dont think this is simple for anyone beside Mike.Annulation
@MihailKrivushin you wanted this "just get ids from results of first query and load childs in separate, comparatively simple query", if you read my answer, I show how to do that, check from "Now, lets try to implement the IN strategy outside of SQL alchemy" - it does exactly what you've asked. As for the new strategy - for sure, you don't need to be Mike for that, you just need to be a programmer. Based on the idea in my answer and using the existing "subquery" strategy implementation, this should be relatively easy (still will require some effort to learn the SQLAlchemy internals).Maremma
A
3

http://docs.sqlalchemy.org/en/latest/orm/loading_relationships.html#sqlalchemy.orm.selectinload

It was added to sqlalchemy, so now you can just use selectinload strategy.

Annulation answered 3/5, 2018 at 16:26 Comment(0)
V
0

You can choose either to work with the abstract ORM layer, in which case, you model the childern attribute as relation with ORM relationship, something like:

from sqlalchemy.orm import relationship

children = relationship("<name of the acl_links class>", lazy="joined")

Using the lazy="joined" result in eager loading, as requested (this is equivalent to joinedload suggested already by @vsminkov) from the documentation:

The default loader strategy for any relationship() is configured by the lazy keyword argument ... Below we set it as joined so that the children relationship is eager loaded using a JOIN

There are quite a few tweaks you can apply when defining the relationship, so check out the documentation to get the best out of it.

Or you can choose to work with the Query API and compose a query as you choose, for example, do a simple second where-in query, e.g.:

list_of_ids_previously_loaded_companies = <information from your previous query>

the_session.query(<name of the acl_links class>).filter(<name of the acl_links class>.eid.in_(list_of_ids_previously_loaded_companies)

You go even lower and use expression language, something like:

q = select([acl_links]).where(acl_links.c.eid.in_(list_of_ids_previously_loaded_companies))
the_session.execute(q).fetchall()

As final resort, you can do completely raw sql:

from sqlalchemy import text

children_results = a_db_connection.execute(text(<SQL STATEMENT STRING>).fetchall()

Choose which ever works best for your needs. Note it is still your responsibility to model correctly your schema and place proper index and foreign keys to optimize performance.

Vigor answered 19/8, 2016 at 10:15 Comment(2)
This doesn't answer the question. The question is asking how to avoid repeating the complex original query as a subquery when doing eager loading.Hulsey
all the above options avoid subquery. it is kind of hard to answer how not to do something. you wrote that a second where-in query is what youre looking for (you wrote in the comments that and also keep saying there is an "original query") and that is also what is being done on the link you shared. so please help us help you. can you please show what query you would have done on plain sql?Vigor
A
0

I made a post to the SQLAlchemy mailing list about this: https://groups.google.com/d/msg/sqlalchemy/8-kHuliJpr8/PHUZLLtMEQAJ

The "in" loading that Boris Serebrov mentioned only seems to work one way by default. It will still run queries (if you’re not doing eager loading) if you access relations from the “one” side of a one to many relationship.

I ended up with this solution: https://gist.github.com/pawl/df5ba8923d9929dd1f4fc4e683eced40

Altamira answered 17/3, 2017 at 3:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.