How to limit contained associations per record/group?
Asked Answered
C

1

11

I have a Model, Articles, which hasMany Abstracts. I want to load the 10 latest Articles, and for each Article, the Abstract with the highest number of points. My function looks like this:

public function getArticles($category, $viewName) {
            $subArticles = $this->Articles->findByCategory($category)->contain([
                    'Abstracts' => function ($q) {
                            return $q
                                    ->select(['body', 'points', 'article_id'])
                                    ->where(['Abstracts.approved' => true])
                                    ->limit(10)
                                    ->order(['Abstracts.points' => 'DESC']);
                    }
            ])
            ->limit(10)
            ->order(['Articles.created' => 'DESC']) ;
            $this->set( $viewName . 'Articles', $subArticles );
    }

The result that I get is not what I intend though. Looking through the SQL, first CakePHP is getting the articles.id of everything in the category (fine). Then, CakePHP goes into the Abstracts table, using those 10 articles.id's it just found, and asks for the 10 Abstracts with the highest votes (that belong to those Articles).

The problem is that I want 1 Abstract for each Article, not the 10 Abstracts belonging to any Article in that category. How can I fix this? Thanks!

EDIT

ndm suggested that this was a duplicate of Using limit() on contained model so I attempted the solution there. Namely, I added this to my Model:

 $this->hasOne('TopAbstract', [
            'className' => 'Abstracts',
            'foreignKey' => 'abstract_id',
            'strategy' => 'select',
            'sort' => ['TopAbstract.points' => 'DESC'],
            'conditions' => function ($e, $query) {
            $query->limit(1);
            return $e;
    } ]);

And then I try to find the Articles byCategory, with contain(['TopAbstract']), only this kills my SQL. It dies a horrible death:

Error: SQLSTATE[HY000]: General error: 1 near ")": syntax error

Debug doesn't even show the query that killed it, so I'm not sure how to debug this one?

EDIT

Talking to myself a bit, but the error is definitely in the 'conditions' part of the hasOne. I take that out, and it works fine. Can't find an example of how this is supposed to look on the interwebs.. anyone have any idea?

Crossing answered 14/5, 2015 at 16:5 Comment(7)
possible duplicate of Using limit() on contained modelBasil
Yes, attempted the proposed solution there to no avail. Will edit my question to reflect the progress/continued failure :-(Crossing
So in other words, there is currently no solution to this problem? Hm.. seems like a pretty common problem that I have right?Crossing
I've deleted my previous comment as I figured there is more wrong. It seems that this will not work like that at all, as the select strategy will cause a single additional query only, and with the limit applied that will always be only a single result. I'm afraid the answer might be, or have become incorrect, or maybe it's just another bug, not sure. When not applying the limit, but only the order, it kinda works, but the order will be reversed, ie an ASC order will pick the result as if it was an DESC order and vice versa.Basil
This is all pretty wonky, and then there's the possible SQL compiler bug that causes the errror (it works if you for example return an empty array instead, $e is an "empty" QueryExpression instance with an AND conjunction), you might want to report this over at GitHub.Basil
Okay, so this method aside, is there another way to do what I'm after? This has to come up all the time with a top rated/modded comment, a best image, etc-- I'm not doing anything fancy here, should be a relatively straightforward way of doing it, right? What am I missingCrossing
Well, guess that depends on you definition of straightforward, but yes, there are other ways (like for example formatting/reducing/filtering results), I'll have to leave for dinner, but I can have a look later on.Basil
B
29

What you are looking for, is a solution to the problem. You didn't mention any specific RDBMS, but nonetheless see also http://dev.mysql.com/doc/refman/5.6/en/example-maximum-column-group-row.html

A library solution

For those who are a little bit adventurous, I've developed some custom associations that transparently integrate into the ORM layer, and allow for basic limit per group for hasMany and belongsToMany relations: https://github.com/icings/partitionable.

Using them, the solution for the question would be to set up an association like this:

$this
    ->partitionableHasMany('TopAbstracts')
    ->setClassName('Abstracts')
    ->setLimit(1)
    ->setSort([
        'Abstracts.points' => 'DESC',
        'Abstracts.id' => 'ASC',
    ]);

TopAbstracts could then be contained just like any other association.

Custom solutions on association level

So let's give this a try, here's three options that can be applied on association level (defining the conditions could also be moved into custom finders), however you might consider them as not that "straightforward".


Select strategy - Using a join on a grouping, max-value subquery

$this->hasOne('TopAbstracts', [
    'className' => 'Abstracts',
    'strategy' => 'select',
    'conditions' => function (\Cake\Database\Expression\QueryExpression $exp, \Cake\ORM\Query $query) {
        $query->innerJoin(
            [
                'AbstractsFilter' => $query
                    ->getConnection()
                    ->newQuery()
                    ->select(['article_id', 'points' => $query->func()->max('points')])
                    ->from('abstracts')
                    ->group('article_id')
            ],
            [
                'TopAbstracts.article_id = AbstractsFilter.article_id',
                'TopAbstracts.points = AbstractsFilter.points'
            ]
        );
        return [];
    }
]);

This will select the top abstracts via a join query that is based on the max points, it will look something like

SELECT
    TopAbstracts.id AS `TopAbstracts__id`, ...
FROM
    abstracts TopAbstracts
INNER JOIN (
        SELECT
            article_id, (MAX(points)) AS `points`
        FROM
            abstracts
        GROUP BY
            article_id
    )
    AbstractsFilter ON (
        TopAbstracts.article_id = AbstractsFilter.article_id
        AND
        TopAbstracts.points = AbstractsFilter.points
    )
WHERE
    TopAbstracts.article_id in (1,2,3,4,5,6,7,8, ...)

Select strategy - Using left self-join filtering

$this->hasOne('TopAbstracts', [
    'className' => 'Abstracts',
    'strategy' => 'select',
    'conditions' => function (\Cake\Database\Expression\QueryExpression $exp, \Cake\ORM\Query $query) {
        $query->leftJoin(
            ['AbstractsFilter' => 'abstracts'],
            [
                'TopAbstracts.article_id = AbstractsFilter.article_id',
                'TopAbstracts.points < AbstractsFilter.points'
            ]);
        return $exp->add(['AbstractsFilter.id IS NULL']);
    }
]);

This will use a self-join that filters based on the rows that don't have a.points < b.points, it will look something like

SELECT
    TopAbstracts.id AS `TopAbstracts__id`, ...
FROM 
    abstracts TopAbstracts
LEFT JOIN
    abstracts AbstractsFilter ON (
        TopAbstracts.article_id = AbstractsFilter.article_id
        AND
        TopAbstracts.points < AbstractsFilter.points
    )
WHERE
    (AbstractsFilter.id IS NULL AND TopAbstracts.article_id in (1,2,3,4,5,6,7,8, ...))

Join strategy - Using a subquery for the join condition

$this->hasOne('TopAbstracts', [
    'className' => 'Abstracts',
    'foreignKey' => false,
    'conditions' => function (\Cake\Database\Expression\QueryExpression $exp, \Cake\ORM\Query $query) {
        $subquery = $query
            ->getConnection()
            ->newQuery()
            ->select(['SubTopAbstracts.id'])
            ->from(['SubTopAbstracts' => 'abstracts'])
            ->where(['Articles.id = SubTopAbstracts.article_id'])
            ->order(['SubTopAbstracts.points' => 'DESC'])
            ->limit(1);

        return $exp->add(['TopAbstracts.id' => $subquery]);
    }
]);

This will use a correlated subquery that uses a rather specific select with simple ordering and limiting to pick the top comment. Note that the foreignKey option is set to false in order to avoid an additional Articles.id = TopAbstracts.article_id condition to be compiled into the join conditions.

The query will look something like

SELECT
    Articles.id AS `Articles__id`, ... ,
    TopAbstracts.id AS `TopAbstracts__id`, ...
FROM
    articles Articles
LEFT JOIN
    abstracts TopAbstracts ON (
        TopAbstracts.id = (
            SELECT
                SubTopAbstracts.id
            FROM
                abstracts SubTopAbstracts
            WHERE
                Articles.id = SubTopAbstracts.article_id
            ORDER BY
                SubTopAbstracts.points DESC
            LIMIT
                1
        )
    )

All these 3 options will query and inject the records without any hackery, it's just not very "straightforward".


A manual approach

For the sake of completeness, it is of course always possible to manually load the associcated records and format the results appropriately, for example using result formatters.

Using window functions

If your DBMS supports window functions, then you could query all associated records in a single additional query. For example:

$query = $this->Articles
    ->find()
    ->formatResults(function(\Cake\Collection\CollectionInterface $results) {
        // extract the article IDs from the results
        $articleIds = array_unique($results->extract('id')->toArray());

        // rank abstracts by points, partitioned by article
        $rankedAbstracts = $this->Articles->Abstracts
            ->find()
            ->select(function (\Cake\ORM\Query $query) {
                return [
                    'id' => 'id',
                    // as of CakePHP 4.1
                    'row_num' => $query
                        ->func()
                        ->rowNumber()
                        ->over()
                        ->partition(['Abstracts.article_id'])
                        ->order(['Abstracts.points' => 'DESC']),
                    // in earlier CakePHP versions instead
                    /*
                    'row_num' => $query->newExpr('
                        ROW_NUMBER() OVER (
                            PARTITION BY Abstracts.article_id
                            ORDER BY Abstracts.points DESC
                        )
                    '),
                    */
                ];
            })
            ->where([
                'Abstracts.article_id IN' => $articleIds,
            ]);

        // fetch top abstracts by ranking
        $topAbstracts = $this->Articles->Abstracts
            ->find()
            ->innerJoin(
                ['RankedAbstracts' => $rankedAbstracts],
                function (
                    \Cake\Database\Expression\QueryExpression $exp,
                    \Cake\ORM\Query $query
                ) {
                    return [
                        'RankedAbstracts.id' => $query->identifier(
                            'Abstracts.id'
                        ),
                        'RankedAbstracts.row_num' => 1,
                    ];
                }
            )
            ->all();

        // inject the associated records into the results
        return $results->map(function ($row) use ($topAbstracts) {
            $row['top_abstract'] = $topAbstracts
                ->filter(function ($value, $key) use ($row) {
                    return $value['article_id'] === $row['id'];
                })
                ->first();

            return $row;
        });
    })

Note that if one wanted to limit to more than 1 result, this could easily be achieved by testing for 'RankedAbstracts.row_num <=' => $limit instead, and injecting all matching records by using ->toList() instead of ->first().

The query will look something like:

SELECT
    Abstracts.id ...
FROM
  abstracts Abstracts 
INNER JOIN
    (
        SELECT
            id AS id,
            ROW_NUMBER() OVER (
                PARTITION BY Abstracts.article_id
                ORDER BY Abstracts.points DESC
            ) AS row_num 
        FROM
            abstracts Abstracts
        WHERE
            Abstracts.article_id IN (...)
    )
    RankedAbstracts ON
        RankedAbstracts.id = Abstracts.id AND
        RankedAbstracts.row_num = 1

See also

Fetch for each parent record individually

If your DBMS is outdated and doesn't support window functions, then another manual approach would be to issue an additional query for each article to fetch the associated top abstract (eg 100 articles would mean 100 additional queries!). Usually one would try to avoid doing this, as it can very quickly perform rather badly depending on the size of the result set. For example:

$query = $this->Articles
    ->find()
    ->formatResults(function(\Cake\Collection\CollectionInterface $results) {
        return $results->map(function ($row) {
            // query the top abstract for the current article
            // and inject it into the result
            $row['top_abstract'] = $this->Articles->Abstracts
                ->find()
                ->where(['Abstracts.article_id' => $row['id']])
                ->sort(['Abstracts.points' => 'DESC'])
                ->limit(1)
                ->all()
                ->first();

            return $row;
        });
    });

See also

Basil answered 16/5, 2015 at 0:45 Comment(5)
Thanks a lot! I ended up using the 3rd option, and worked great. Close enough to 'straightforward' for me :) thanks againCrossing
I have a similar situation at #32922940. Difference is that instead of a one-to-many between Article and Abstract. Mine is a many to many. How do I modify any of your 3 options to work for that?Sammysamoan
@Basil .Some days before i have faced this type of problem.You refered here Now it is really too much helpful to me. i got to know a new things ,its the hardest thing ,i have faced in every project . finally i ended with . thank you man. i would like to upvote you.Homiletic
I've tried the "Select strategy - Using a join on a grouping, max-value subquery" way but this apparently don't work when you have a table with composite primary key.Lapotin
@tiagoa Composite keys should generally work fine, you may want to open a new question where you can properly elaborate on the problem that you are seeing, and provide code examples that reproduce it.Basil

© 2022 - 2024 — McMap. All rights reserved.