Limit child collections in initial query sqlalchemy
Asked Answered
A

4

11

I am building an api which can return children of resources if the user requests it. For example, user has messages. I want the query to be able to limit the number of message objects that are returned.

I found a useful tip aboutl imiting the number of objects in child collections here. Basically, it indicates the following flow:

class User(...):
    # ...
    messages = relationship('Messages', order_by='desc(Messages.date)', lazy='dynamic')

user = User.query.one()
users.messages.limit(10)

My use case involves returning sometimes large numbers of users.

If I were to follow the advice in that link and used .limit() then I would need to iterate over the entire collection of users calling .limit() on each one. This is much less efficient then, say, using LIMIT in the original sql expression which created the collection.

My question is whether it is possible using declarative to efficiently(N+0) load a large collection of objects while limiting the number of children in their child collections using sqlalchemy?

UPDATE

To be clear, the below is what I am trying to avoid.

users = User.query.all()
messages = {}
for user in users:
    messages[user.id] = user.messages.limit(10).all()

I want to do something more like:

users = User.query.option(User.messages.limit(10)).all()
Axillary answered 1/5, 2017 at 22:11 Comment(0)
A
6

This answer comes from Mike Bayer on the sqlalchemy google group. I'm posting it here to help folks: TLDR: I used version 1 of Mike's answer to solve my problem because, in this case, I do not have foreign keys involved in this relationship and so cannot make use of LATERAL. Version 1 worked great, but be sure to note the effect of offset. It threw me off during testing for a while because I didn't notice it was set to something other than 0.

Code Block for version 1:

subq = s.query(Messages.date).\
    filter(Messages.user_id == User.id).\
    order_by(Messages.date.desc()).\
    limit(1).offset(10).correlate(User).as_scalar()

q = s.query(User).join(
    Messages,
    and_(User.id == Messages.user_id, Messages.date > subq)
).options(contains_eager(User.messages))

Mike's Answer so you should ignore whether or not it uses "declarative", which has nothing to do with querying, and in fact at first ignore Query too, because first and foremost this is a SQL problem. You want one SQL statement that does this. What query in SQL would load lots of rows from the primary table, joined to the first ten rows of the secondary table for each primary?

LIMIT is tricky because it's not actually part of the usual "relational algebra" calculation. It's outside of that because it's an artificial limit on rows. For example, my first thought on how to do this was wrong:

    select * from users left outer join (select * from messages limit 10) as anon_1 on users.id = anon_1.user_id

This is wrong because it only gets the first ten messages in the aggregate, disregarding user. We want to get the first ten messages for each user, which means we need to do this "select from messages limit 10" individually for each user. That is, we need to correlate somehow. A correlated subquery though is not usually allowed as a FROM element, and is only allowed as a SQL expression, it can only return a single column and a single row; we can't normally JOIN to a correlated subquery in plain vanilla SQL. We can however, correlate inside the ON clause of the JOIN to make this possible in vanilla SQL.

But first, if we are on a modern Postgresql version, we can break that usual rule of correlation and use a keyword called LATERAL, which allows correlation in a FROM clause. LATERAL is only supported by modern Postgresql versions, and it makes this easy:

    select * from users left outer join lateral
    (select * from message where message.user_id = users.id order by messages.date desc limit 10) as anon1 on users.id = anon_1.user_id

we support the LATERAL keyword. The query above looks like this:

subq = s.query(Messages).\
    filter(Messages.user_id == User.id).\
    order_by(Messages.date.desc()).limit(10).subquery().lateral()

q = s.query(User).outerjoin(subq).\
     options(contains_eager(User.messages, alias=subq))

Note that above, in order to SELECT both users and messages and produce them into the User.messages collection, the "contains_eager()" option must be used and for that the "dynamic" has to go away. This is not the only option, you can for example build a second relationship for User.messages that doesn't have the "dynamic" or you can just load from query(User, Message) separately and organize the result tuples as needed.

if you aren't using Postgresql, or a version of Postgresql that doesn't support LATERAL, the correlation has to be worked into the ON clause of the join instead. The SQL looks like:

select * from users left outer join messages on
users.id = messages.user_id and messages.date > (select date from messages where messages.user_id = users.id order by date desc limit 1 offset 10)

Here, in order to jam the LIMIT in there, we are actually stepping through the first 10 rows with OFFSET and then doing LIMIT 1 to get the date that represents the lower bound date we want for each user. Then we have to join while comparing on that date, which can be expensive if this column isn't indexed and also can be inaccurate if there are duplicate dates.

This query looks like:

subq = s.query(Messages.date).\
    filter(Messages.user_id == User.id).\
    order_by(Messages.date.desc()).\
    limit(1).offset(10).correlate(User).as_scalar()

q = s.query(User).join(
    Messages,
    and_(User.id == Messages.user_id, Messages.date >= subq)
).options(contains_eager(User.messages))

These kinds of queries are the kind that I don't trust without a good test, so POC below includes both versions including a sanity check.

from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base
import datetime

Base = declarative_base()


class User(Base):
    __tablename__ = 'user'
    id = Column(Integer, primary_key=True)
    messages = relationship(
        'Messages', order_by='desc(Messages.date)')

class Messages(Base):
    __tablename__ = 'message'
    id = Column(Integer, primary_key=True)
    user_id = Column(ForeignKey('user.id'))
    date = Column(Date)

e = create_engine("postgresql://scott:tiger@localhost/test", echo=True)
Base.metadata.drop_all(e)
Base.metadata.create_all(e)

s = Session(e)

s.add_all([
    User(id=i, messages=[
        Messages(id=(i * 20) + j, date=datetime.date(2017, 3, j))
        for j in range(1, 20)
    ]) for i in range(1, 51)
])

s.commit()

top_ten_dates = set(datetime.date(2017, 3, j) for j in range(10, 20))


def run_test(q):
    all_u = q.all()
    assert len(all_u) == 50
    for u in all_u:

        messages = u.messages
        assert len(messages) == 10

        for m in messages:
            assert m.user_id == u.id

        received = set(m.date for m in messages)

        assert received == top_ten_dates

# version 1.   no LATERAL

s.close()

subq = s.query(Messages.date).\
    filter(Messages.user_id == User.id).\
    order_by(Messages.date.desc()).\
    limit(1).offset(10).correlate(User).as_scalar()

q = s.query(User).join(
    Messages,
    and_(User.id == Messages.user_id, Messages.date > subq)
).options(contains_eager(User.messages))

run_test(q)

# version 2.  LATERAL

s.close()

subq = s.query(Messages).\
    filter(Messages.user_id == User.id).\
    order_by(Messages.date.desc()).limit(10).subquery().lateral()

q = s.query(User).outerjoin(subq).\
    options(contains_eager(User.messages, alias=subq))

run_test(q)
Axillary answered 2/5, 2017 at 6:43 Comment(3)
This is awesome, Im working through this in my code now. I do feel that it may actually not be correct. Does this only work because your message dates are the same across all users? The subquery looks like it gives 1 date back offset by 10 from a DESC list. All your messages fit into that block, and therefore come back from the outer query. Im still experimenting with mixed resultsYorick
My best guess here is that if you have messages that fall within the date range, it will include all the users/messages that match. If there are no messages newer than the most recent, then it will not return any, or less that there desired 50 (since there wont be that many to retrieve).Yorick
ok .. this is some complex shazzz .. Thank you again. Appears there is some magic in there that does the look based on the messages date per user. Thats awesome. Where its tripping me up if the offset is larger than the number of messages for that user, it seems to come back empty. There may be a way to do an or clause or something .. fun saturdayYorick
F
0

If you apply limit and then call .all() on it, you will get all objects once and it will not get objects one by one , causing performance issue that you mentioned.

simply apply limit and get all objects.

users = User.query.limit(50).all()
print(len(users))
>>50

Or for child objects / relationships

user = User.query.one()
all_messages = user.messages.limit(10).all()


users = User.query.all()
messages = {}
for user in users:
    messages[user.id] = user.messages.limit(10).all()
Fetich answered 1/5, 2017 at 23:2 Comment(5)
Sorry for the confusion. I do not want to limit the number of users. I want to limit the number of messages in user.messages (a collection child to users).Axillary
Same applies to lazy child collections / relationshipsFetich
To be more clear, I want to get all() users, and on each individual object returned I want the number of messages limited to 10 (for example).Axillary
Sorry for the continued confusion. I have updated my question to be more clear.Axillary
You don't need to get all user objects one by one to apply limit on child items. simply get all users and then apply limit on all messages. You can't get 10 message with user query. You have to iterateFetich
T
0

So, I think you'll need to load the messages in a second query and then later associate with your users somehow. The following is database dependent; as discussed in this question, mysql does not support in queries with limits, but sqlite at least will parse the query. I didn't look at the plan to see if it did a good job. The following code will find all the message objects you care about. You then need to associate them with users.
I've tested this to confirm that it produces a query sqlite can parse; I have not confirmed that sqlite or any other database does the right thing with this query. I had to cheat a bit and use the text primitive to refer to the outer user.id column in the select because SQLAlchemy kept wanting to include an additional join to users in the inner select subjquery.

from sqlalchemy import Column, Integer, String, ForeignKey, alias
from sqlalchemy.sql import text

from sqlalchemy.orm import Session
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key = True)
    name = Column(String)

class Message(Base):
    __tablename__ = 'messages'
    user_id = Column(Integer, ForeignKey(User.id), nullable = False)
    id = Column(Integer, primary_key = True)


s = Session()
m1 = alias(Message.__table__)

user_query = s.query(User) # add any user filtering you want
inner_query = s.query(m1.c.id).filter(m1.c.user_id == text('users.id')).limit(10)
all_messages_you_want = s.query(Message).join(User).filter(Message.id.in_(inner_query))

To associate the messages with users, you could do something like the following assuming your Message has a user relation and your user objects have a got_child_message method that does whatever you like for this

users_resulting = user_query.all() #load objects into session and hold a reference
for m in all_messages_you_want: m.user.got_child_message(m)

Because you already have the users in the session and because the relation is on User's primary key, m.user resolves to query.get against the identity map. I hope this helps you get somewhere.

Tananarive answered 2/5, 2017 at 0:45 Comment(0)
Y
0

@melchoirs answer is the best. I basically putting this here for futureselves

I played around with the above stated answer, and it works, I needed it more so to limit the number of associations returned before passing into a Marshmallow Serializer.

Some issues for clarification:

  • the subquery runs per association, hence it finds the corresponding date to base off properly
  • think about the limit/offset as give me 1 (limit) record starting at the next X (offset). Hence what is the Xth oldest record, and then in the main query it gives everything back from that. Its damn smart
  • It appears that if the association has less than X records, it returns nothing, as the offset is past the records, and henceforth the main query does not return a record.

Using the above as a template, I came up with the below answer. The initial query/count guard is due to the issue that if the associated records are less than the offset, nothing is found. In addition, I needed to add an outerjoin in the event that there are no associations either.

At the end, I found this query to be a bit or ORM voodoo, and didn't want to go that route. I instead exclude the histories from the device serializer, and require a second history lookup using the device ID. That set can be paginated and makes everything a bit cleaner.

Both methods work, it just comes down to the why you'll need to do the one query versus a couple. In the above, there was probably business reasons to get everythng back more efficiently with the single query. For my use case, readability, and convention trumped the voodoo

@classmethod
    def get_limited_histories(cls, uuid, limit=10):

        count = DeviceHistory.query.filter(DeviceHistory.device_id == uuid).count()

        if count > limit:
            sq = db.session.query(DeviceHistory.created_at) \
                .filter(DeviceHistory.device_id == Device.uuid) \
                .order_by(DeviceHistory.created_at.desc()) \
                .limit(1).offset(limit).correlate(Device)


        return db.session.query(Device).filter(Device.uuid == uuid) \
                .outerjoin(DeviceHistory,
                    and_(DeviceHistory.device_id == Device.uuid, DeviceHistory.created_at > sq)) \
                .options(contains_eager(Device.device_histories)).all()[0]

It then behaves similar to a Device.query.get(id) but Device.get_limited_histories(id)

  • ENJOY
Yorick answered 8/9, 2019 at 14:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.