SQLAlchemy - limit the joinedloaded results
Asked Answered
S

3

9

Models:

class Team(Base):
    id = Column(Integer, primary_key=True)
    name = Column(String, nullable=False)

    players = relationship("Player", backref="team")

class Player(Base):
    id = Column(Integer, primary_key=True)
    name = Column(String(255), unique=True)
    team_id = Column(Integer, ForeignKey("team.id"))
    positions = relationship("Position", backref="player")


class Position(Base):
    id = Column(Integer(), primary_key=True)
    name = Column(String(255), unique=True)
    player_id = Column(Integer, ForeignKey("player.id"))
    goals = relationship("Goal", backref="position")


class Goal(Base):
    id = Column(Integer(), primary_key=True)
    distance = Column(Integer)
    position_id = Column(Integer, ForeignKey("position.id"))

# Query to get all goals of all players of a team
query = (
    select(Team)
    .select_from(Player, Position, Goal)
    .options(joinedload(Team.players))
    .options(
        joinedload(
            Team.players,
            Player.positions,
        )
    )
    .options(
        joinedload(
            Team.players,
            Player.positions,
            Position.goals,
        )
    )
result = await db.execute(query)
response = result.scalar()

Sample json output from above query,


{
    "id": 3,
    "players": [
        {
            "id": 3,
            "positions": []
        },
        {
            "id": 5,
            "positions": [
                {
                    "id": 7,
                    "goals": [
                        {
                            "id": 13,
                        }
                    ]
                }
            ]
        },
        {
            "id": 1,
            "positions": [
                {
                    "id": 1,
                    "goals": [
                        {
                            "id": 16,
                        },
                        {
                            "id": 15,
                        },
                        {
                            "id": 14,
                        }
                    ]
                },
                {
                    "id": 2,
                    "goals": [
                        {
                            "id": 4,
                        }
                    ]
                }
            ]
        }
    ]
}

From sample json, we can clearly see that multiple goals are returned for a player id=1.

Now, I need to limit the query to last goal of each player, instead of all goals of that player.

So I tried,

subquery = (
    select(Goal)
    .order_by(Goal.id.desc())
    .limit(1)
    .subquery()
    .lateral()
)

query = (
    select(Team)
    .select_from(Player, Position, Goal)
    .options(joinedload(Team.players))
    .options(
        joinedload(
            Team.players,
            Player.positions,
        )
    )
    .outerjoin(subquery)
    .options(
        contains_eager(
            Team.players,
            Player.positions,
            Position.goals,
            alias=subquery,
        )
    )
result = await db.execute(query)
response = result.scalar()

Sample json output from above query

{
    "id": 3,
    "players": [
        {
            "id": 3,
            "positions": []
        },
        {
            "id": 5,
            "positions": [
                {
                    "id": 7,
                    "goals": [
                        {
                            "id": 16,
                        }
                    ]
                }
            ]
        },
        {
            "id": 1,
            "positions": [
                {
                    "id": 1,
                    "goals": [
                        {
                            "id": 16,
                        }
                    ]
                },
                {
                    "id": 2,
                    "goals": [
                         {
                            "id": 16,
                        }
                    ]
                }
            ]
        }
    ]
}

This fetches the last goal of any player, but not the last goal of respective player.

Filters like Goal.position_id == Position.id in outerjoin or subquery either does not work or result in error.

Edit:

It looks like I need populate_existing(), but it's not available in new select method.

Edit 2:

To simplify these queries, I am also thinking to create last_goal_id column in position table and update the position table to store id of last inserted goal. Are foreign keys to each other in 2 tables normal? goal would have position_id and position would have last_goal_id.

Sweeps answered 3/5, 2022 at 7:27 Comment(4)
What is the purpose of both the first query? Do you really want to pre-fetch everything for a Team from the database? And what is the purpose of modified query? Do you really need intermediary objects such as Position? Do you need it in the result, especially given that the last goal is not per Position, right?Suffix
@van, Yes, I need Position table data along with last goal of player. I have updated question to add sample response from both queries to help understand my issue.Sweeps
Does my answer gives the required output? If not, please add some sample data and the respective result (json) output that you want to achieve.Rostov
Thank you, @Jashwant. That clears it up. I have an idea for the solution, which I will post tomorrow.Suffix
S
7

Preamble

First, I believe that the line below should not be part of the query as it will create a cartesian product. Look for sqlalchemy warnings when executing the query:

.select_from(Player, Position, Goal)  # DELETE this as it creates cartesian product

Second, you can simplify your original query somewhat. Below produces a query equivalent to the one in your question:

# Query to get all goals of all players of a team
query1 = (
    select(Team)
    # .select_from(Player, Position, Goal)  # DELETE this as it creates cartesian product
    .options(
         joinedload(Team.players)
        .joinedload(Player.positions)
        .joinedload(Position.goals)
    )
)

contains_eager as alternative to joinedload

Above query can also be implemented differently by a) joining the related tables explicitly, and b) hinting to sqlalchemy that the query already contains the desired relationships:

query2 = (
    select(Team)
    .outerjoin(Team.players)
    .outerjoin(Player.positions)
    .outerjoin(Position.goals)
    .options(contains_eager(
        Team.players, 
        Player.positions, 
        Position.goals,
    ))  
)

SOLUTION:

Given that we can now be more explicit on the relationship join conditions, one way of implementing the query would be below:

# subquery to use in the join for getting only the last 1 goal for each Position
subq = (
    select(Goal.id.label("last_goal_id"))
    .filter(Goal.position_id == Position.id)
    .order_by(Goal.id.desc())
    .limit(1)
    .scalar_subquery()
    .correlate(Position)
)

query3 = (
    select(Team)
    .outerjoin(Team.players)
    .outerjoin(Player.positions)
    .outerjoin(Goal, Goal.id == subq)  # use the JOIN which includes ONLY last Goal, but ...
    .options(contains_eager(
        Team.players, 
        Player.positions, 
        Position.goals,  # ... tell sqlalchemy that we actually loaded ALL `.goals`
    ))
)

which produces the following SQL (sqlite):

SELECT goal.id,
       goal.distance,
       goal.position_id,
       position.id AS id_1,
       position.name,
       position.player_id,
       player.id AS id_2,
       player.name AS name_1,
       player.team_id,
       team.id AS id_3,
       team.name AS name_2
FROM team
LEFT OUTER JOIN player ON team.id = player.team_id
LEFT OUTER JOIN position ON player.id = position.player_id
LEFT OUTER JOIN goal ON goal.id =
  (SELECT goal.id AS last_goal_id
   FROM goal
   WHERE goal.position_id = position.id
   ORDER BY goal.id DESC
   LIMIT 1)

alternative SOLUTIONs:

What you could also do it to create a hybrid_property to have computed column which points to the last Goal.id per Position and use it to define a relationship which will only contain last Goal in the list:

class Position(Base):
    __tablename__ = "position"
    id = Column(Integer(), primary_key=True)
    name = Column(String(255), unique=True)
    player_id = Column(Integer, ForeignKey("player.id"))
    goals = relationship("Goal", backref="position")

    @hybrid_property
    def last_goal_id(self):
        ...

    @last_goal_id.expression
    def last_goal_id(cls):
        stmt = (
            select(Goal.id.label("last_goal_id"))
            # .filter(Goal.position_id == Position.id)
            .filter(Goal.position_id == cls.id)
            .order_by(Goal.id.desc())
            .limit(1)
            .scalar_subquery()
            .correlate(cls)
            # .correlate_except(Goal)
        )
        return stmt

    last_goals = relationship(
        lambda: Goal,
        primaryjoin=lambda: and_(
            Goal.position_id == Position.id,
            Goal.id == Position.last_goal_id,
        ),
        viewonly=True,
        uselist=True,
    )

In this case you could use the following query, but you should not navigate Position.goals relationship as it will load the whole list. Also the name of the json key will be different.

query1 = (
    select(Team)
    .options(
        joinedload(Team.players)
        .joinedload(Player.positions)
        .joinedload(Position.last_goals)  # use `.last_goals` instead of `.goals`
    )
)

NOTE: I personally like this the most as it is clean and explicit.

You could even mix the techniques to get both sides: use the .last_goals relationship, but trick SA to think it is the fully loaded .goals:

query2 = (
    select(Team)
    .outerjoin(Team.players)
    .outerjoin(Player.positions)
    .outerjoin(Position.last_goals)  # join via `.last_goals` relationship join, but ...
    .options(contains_eager(
        Team.players, 
        Player.positions, 
        Position.goals,  # ... tell sqlalchemy that we actually loaded `.goals`
    ))
)
Suffix answered 19/5, 2022 at 5:9 Comment(4)
Thanks for explaining along with the solution. It works perfectly. Can I also ask if the final sql query is the right way to go? Or should I reduce little bit complexity by storing last_goal_id in position table (Edit 2 in my question) ?Sweeps
@Jashwant, thanks for accepting my answer, I am glad you found it helpful. As to your follow up question: maintaining two-way relationship could be difficult and might lead to situations where items are out of sync as RDBMS does not support such relationship out of the box. However, there are ways to use computed columns to store such items either in the DB or as in-memory/view. I will expand the answer with the alternative solution to show what I mean.Suffix
Thanks for this answer, I never knew how to get last_goal from last_goal_id. After using this in my own code, I might as well mention how I set (my equivelant of) last_goal to automatically load without requiring any query options. I changed last_goal_id to a column_property, set it to correlate to Goal rather than cls (not sure if bug on your end or just different on mine), then added lazy='joined' on last_goal.Simultaneous
Hi @Peter. I did consider column_property which I do use quite a bit on my projects. But the decision depends on the context. Here I assumed that by default one would not want to load anything automatically, but only when specified in a query. Also for this particular question I was getting an error when running with column_property while hybrid_property worked well.Suffix
R
2

I think what you want can be achieved by using a DISTINCT ON clause to remove the duplicate rows retrieved from the Goal object:

query = (
    # Select from Goal and join all the required tables
    select(Goal)
    .join(Goal.position)
    .join(Position.player)
    .join(Player.team)
    # Remove duplicate rows based on the Player id
    .distinct(Player.id)   
    # Order by `Player.id` (required for distinct) and descending on the goal_id to have the latest added goals (newest) first
    .order_by(Player.id, Goal.id.desc())   
    )

When using the sample date below this results in:

{
    "id": 3,
    "players": [
        {
            "id": 5,
            "positions": [
                {
                    "id": 7,
                    "goals": [
                        {
                            "id": 13,
                        }
                    ]
                }
            ]
        },
        {
            "id": 1,
            "positions": [
                {
                    "id": 1,
                    "goals": [
                        {
                            "id": 16,
                        }
                    ]
                }
            ]
        }
    ]
}

I think the issue here is that Player 3 is missing as he has no goal.

You can also turn the query with the DISTINCT ON clause around which results in:

query = (
    # Select all the required tables
    select(Team, Player, Position, Goal)
    # outerjoin all required tables resulting in a `LEFT OUTER JOIN`
    .outerjoin(Team.players)
    .outerjoin(Player.positions)
    .outerjoin(Position.goals)
    # Remove duplicate rows based on the Player id
    .distinct(Player.id)
    # Order by `Player.id` (required for distinct) and descending on the goal_id to have the latest added goals (newest) first
    .order_by(Player.id, Goal.id.desc())
    )

As we are now starting from Team and going down the Player without any Goal is also included resulting in the following data:

{
    "id": 3,
    "players": [
        {
            "id": 3,
            "positions": []
        },
        {
            "id": 5,
            "positions": [
                {
                    "id": 7,
                    "goals": [
                        {
                            "id": 13,
                        }
                    ]
                }
            ]
        },
        {
            "id": 1,
            "positions": [
                {
                    "id": 1,
                    "goals": [
                        {
                            "id": 16,
                        }
                    ]
                }
            ]
        }
    ]
}

Sample data

{
    "id": 3,
    "players": [
        {
            "id": 3,
            "positions": []
        },
        {
            "id": 5,
            "positions": [
                {
                    "id": 7,
                    "goals": [
                        {
                            "id": 13,
                        }
                    ]
                }
            ]
        },
        {
            "id": 1,
            "positions": [
                {
                    "id": 1,
                    "goals": [
                        {
                            "id": 16,
                        },
                        {
                            "id": 15,
                        },
                        {
                            "id": 14,
                        }
                    ]
                },
                {
                    "id": 2,
                    "goals": [
                        {
                            "id": 4,
                        }
                    ]
                }
            ]
        }
    ]
}
Rostov answered 11/5, 2022 at 21:32 Comment(2)
this only outputs team data. No player, position, goals data.Sweeps
@Sweeps how did you test this? When running the query it outputs the following (direct copy from the console): [(Team(id=1, name='team3'), Player(id=1, name='player3'), None, None), (Team(id=1, name='team3'), Player(id=2, name='player5'), Position(id=1, name='position7'), Goal(id=1, distance=13)), (Team(id=1, name='team3'), Player(id=3, name='player1'), Position(id=3, name='position1'), Goal(id=5, distance=16))] This contains all your required data...Rostov
S
1

Have a look into using RANK, it may do what you need, though it would require a few queries/subqueries instead of one big joinedload.

I'd have a subquery to rank the goal dates and partition them by the positions or players, and filter it to where rank equals 1. That will get you the latest goal for each position, which you can create a dict for. With your main query, you can then use the position ID to find the latest goal using that dict.

Something like this:

# Rank goals by id and position
subquery = select(
    Goal.id.label('goal_id'),
    Goal.position_id,
    func.rank().over(order_by=Goal.id.desc(), partition_by(Goal.position_id)).label('rank'),
).subquery()

# Create dict of {position_id: latest_goal_id} to use as a lookup
latest_goal_query = (
    select(subquery.c.goal_id, subquery.c.position_id)
    .where(subquery.c.rank == 1)
)
latest_goal_ids = {pos_id: goal_id for goal_id, pos_id in session.execute(latest_goals).fetchall()}

# Get goal objects from the IDs
goal_query = select(Goal).where(Goal.id.in_(latest_goals.values()))
goals = {goal.id: goal for goal in session.execute(goal_query).scalars()}

# Map position ID to the latest goal object
latest_goals = {pos_id: goals[goal_id] for pos_id, goal_id in latest_goal_ids.items()}

# Read the team and position, and you can use the position_id to get the latest goal
query = ...

As a heads up btw - I used to attempt joinedload on everything until the author of SQLAlchemy told me that selectinload should be used when possible, because it fetches only the data you need, whereas joins may have a ton of duplicate data (eg. if your team has 20 players with 5 positions each and 20 goals each, then I think joining it all will result in each team name being sent 2000 times, and each player name being sent 100 times).


Edit: column_property just came to mind as an alternative solution. Unfortunately I've never been able to figure how to map the actual Goal model, so this isn't perfect, but here's an example of how you could add the ID of the latest goal directly to the Player model.

class Player(Base):
    ...
    latest_goal_id = column_property(
        select(Goal.id)
        .where(Goal.position.has(Position.player_id == id)),
        .order_by(Goal.id.desc()).limit(1)
    )

From the point of view of a query, it's just treated as another column, so you can select and filter by it.

Simultaneous answered 3/5, 2022 at 9:42 Comment(4)
I was hoping not to use many subqueries but fetch in just 1 big query. Thanks for the heads up. I have recently learned about selectinload when searching for answer to this problem. I will eventually switch to selectinload when this issue gets resolved.Sweeps
You can technically put this all as one query, but you've got to be careful you're not adding a ton of overhead, and it does also become a lot more complex. Having tried both ways over the past year or two, I prefer using separate queries as you can really easily optimise what data you're fetching, and it takes minutes to write, compared to my one "big query" that literally took an entire working day to come up with the first version :)Simultaneous
I'll keep your suggestion in mind. I have added a bounty on it for now to see if someone else can come up with a better solution.Sweeps
I did a quick edit on the post with an idea that just came to mind, it's probably not what you're after, but in case you're not aware of it, it's a useful feature to knowSimultaneous

© 2022 - 2024 — McMap. All rights reserved.