PostgreSQL DISTINCT ON with different ORDER BY

Asked 20/3, 2012 at 21:59 Answered 27/11, 2020 at 10:15

Solved sql postgresql sql-order-by distinct-on

339

I want to run this query:

SELECT DISTINCT ON (address_id) purchases.address_id, purchases.*
FROM purchases
WHERE purchases.product_id = 1
ORDER BY purchases.purchased_at DESC

But I get this error:

PG::Error: ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions

Adding address_id as first ORDER BY expression silences the error, but I really don't want to add sorting over address_id. Is it possible to do without ordering by address_id?

Bim answered 20/3, 2012 at 21:59 Comment(5)

Your order clause has purchased_at not address_id.Can you make your question clear. – Hoisch 20/3, 2012 at 22:1

my order has purchase because i want it, but postgres also asks for address(see error message). – Bim 20/3, 2012 at 22:3

Fully answered here - #9796578 Thanks to stackoverflow.com/users/268273/mosty-mostacho – Bim 21/12, 2012 at 23:40

Personally I think requiring DISTINCT ON to match ORDER BY is very questionable, as there are a variety of legitimate use cases for having them differ. There is a post on postgresql.uservoice trying to change this for those who feel similarly. postgresql.uservoice.com/forums/21853-general/suggestions/… – Endoergic 16/7, 2019 at 21:37

got the exact same issue, and facing the same limtation. At the moment I have broken it into a sub-query and then ordering, but it feels dirty. – Foochow 24/10, 2019 at 6:30

315

Documentation says:

DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. [...] Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. [...] The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s).

Official documentation

So you'll have to add the address_id to the order by.

Alternatively, if you're looking for the full row that contains the most recent purchased product for each address_id and that result sorted by purchased_at then you're trying to solve a greatest N per group problem which can be solved by the following approaches:

The general solution that should work in most DBMSs:

SELECT t1.* FROM purchases t1
JOIN (
    SELECT address_id, max(purchased_at) max_purchased_at
    FROM purchases
    WHERE product_id = 1
    GROUP BY address_id
) t2
ON t1.address_id = t2.address_id AND t1.purchased_at = t2.max_purchased_at
ORDER BY t1.purchased_at DESC

A more PostgreSQL-oriented solution based on @hkf's answer:

SELECT * FROM (
  SELECT DISTINCT ON (address_id) *
  FROM purchases 
  WHERE product_id = 1
  ORDER BY address_id, purchased_at DESC
) t
ORDER BY purchased_at DESC

Problem clarified, extended and solved here: Selecting rows ordered by some column and distinct on another

Gunstock answered 20/3, 2012 at 22:8 Comment(12)

It works, but gives wrong ordering. That's why i want to get rid of address_id in order clause – Bim 20/3, 2012 at 22:12

But may be there is another way to select latest purchases for disticnt addresses? – Bim 20/3, 2012 at 22:19

You have to order over address_id, but this works if you order by purchases.purchased_at DESC because for each address, it grabs the most recent purchase. – Topper 2/10, 2014 at 15:0

If you need to order by purchases.purchased_at, you can add purchased_at to your DISTINCT conditions: SELECT DISTINCT ON (purchases.purchased_at, address_id). However, two records with the same address_id but different purchased_at values will result in duplicates in the returned set. Make sure you are cognizant of the data you're querying. – Overlord 26/1, 2015 at 21:44

The spirit of the question is clear. No need to pick on semantics. It's sad that the accepted and most voted answer doesn't help you solve the problem. – Krever 7/4, 2017 at 22:39

This query, though accepted, does not produce the desired ordering in the result (with ORDER BY purchases.purchased_at DESC). – Odaniel 13/7, 2017 at 0:5

Your fix is not based on hkf's answer, which is incorrect and misleading. It happens to be an exact copy of what I posted 5 years ago. – Odaniel 17/7, 2017 at 15:19

@ErwinBrandstetter I did base it on hkf's answer. The issue is that your answer happens to be the same as hkf's (except for the ordering, of course). Regarding the ordering, you have suggested me to use the improvement for the ORDER BY in your comment above, so I improved the answer with your own feedback. I'm not sure what's the matter with that. Anyway, feel free to edit my answer as you deem appropriate – Gunstock 17/7, 2017 at 19:49

@MostyMostacho: It's not the same, and it's not just the ordering. That's the point. – Odaniel 18/7, 2017 at 2:47

Here is a postgresql.uservoice post trying to lift this limitation for those that agree it is a questionable limitation. postgresql.uservoice.com/forums/21853-general/suggestions/… – Endoergic 16/7, 2019 at 21:40

as someone suggested in the other answer, the more generic database solution will return two rows with the same address_id for the same purchased_at – Viper 4/9, 2020 at 23:59

"The DISTINCT ON expression(s) must match the leftmost ORDER BY expression" But why though? It should suffice to order on purchased_at since that will put the desired row at the top of the group. – Headway 3/1 at 13:36

A subquery can solve it:

SELECT *
FROM  (
    SELECT DISTINCT ON (address_id) *
    FROM   purchases
    WHERE  product_id = 1
    ) p
ORDER  BY purchased_at DESC;

Leading expressions in ORDER BY have to agree with columns in DISTINCT ON, so you can't order by different columns in the same SELECT.

Only use an additional ORDER BY in the subquery if you want to pick a particular row from each set:

SELECT *
FROM  (
    SELECT DISTINCT ON (address_id) *
    FROM   purchases
    WHERE  product_id = 1
    ORDER  BY address_id, purchased_at DESC  -- get "latest" row per address_id
    ) p
ORDER  BY purchased_at DESC;

If purchased_at can be NULL, use DESC NULLS LAST - and match your index for best performance. See:

Related, with more explanation:

Odaniel answered 20/3, 2012 at 22:40 Comment(5)

You cannot use DISTINCT ON without a matching ORDER BY. The first query requires an ORDER BY address_id inside the subquery. – Cullen 12/7, 2017 at 18:46

@AristotlePagaltzis: But you can. Wherever you got that from, it's incorrect. You can use DISTINCT ON without ORDER BY in the same query. You get an arbitrary row from each set of peers defined by the DISTINCT ON clause in this case. Try it or follow the links above for details and links to the manual. ORDER BY in the same query (the same SELECT) just cannot disagree with DISTINCT ON. I did explain that, too. – Odaniel 13/7, 2017 at 0:8

Huh, you’re right. I was blind to the implication of the “unpredictable unless ORDER BY is used” note in the docs because it does not make sense to me that the feature is implemented to be able deal with non-consecutive sets of values… yet won’t allow you to exploit that with an explicit ordering. Annoying. – Cullen 13/7, 2017 at 6:31

@AristotlePagaltzis: That's because, internally, Postgres uses one of (at least) two distinct algorithms: either traverse a sorted list or work with hash values - whichever promises to be faster. In the later case the result is not sorted by DISTINCT ON expressions (yet). – Odaniel 13/7, 2017 at 15:15

Thanks a lot! Your second query solved my issue and returns results in expected order! – Iatrics 8/5, 2020 at 18:28

You can order by address_id in an subquery, then order by what you want in an outer query.

SELECT * FROM 
    (SELECT DISTINCT ON (address_id) purchases.address_id, purchases.* 
    FROM "purchases" 
    WHERE "purchases"."product_id" = 1 ORDER BY address_id DESC ) 
ORDER BY purchased_at DESC

Overvalue answered 20/3, 2012 at 22:4 Comment(6)

But this will be slower than just one query, no? – Bim 20/3, 2012 at 22:5

Very marginally yes. Although since you have a purchases.* in your original select, I don't think this is production code? – Overvalue 20/3, 2012 at 22:6

I'd add that for newer versions of postgres you need to alias the subquery. For example: SELECT * FROM (SELECT DISTINCT ON (address_id) purchases.address_id, purchases.* FROM "purchases" WHERE "purchases"."product_id" = 1 ORDER BY address_id DESC ) AS tmp ORDER BY tmp.purchased_at DESC – Insatiate 17/6, 2014 at 20:38

This would return address_id twice (without need). Many clients have problems with duplicate column names. ORDER BY address_id DESC is pointless and misleading. It does nothing useful in this query. The result is an arbitrary pick from each set of rows with the same address_id, not the row with the latest purchased_at. The ambiguous question did not ask for that explicitly, but that's almost certainly the OP's intention. In short: do not use this query. I posted alternatives with explanation. – Odaniel 17/7, 2017 at 15:22

Worked for me. Great answer. – Clevie 24/4, 2020 at 23:31

this does not sort it by purchased_at in a subquery at all, it will return just randomly ordered rows of distinct addres_ids – Viper 5/9, 2020 at 0:10

Window function may solve that in one pass:

SELECT DISTINCT ON (address_id) 
   LAST_VALUE(purchases.address_id) OVER wnd AS address_id
FROM "purchases"
WHERE "purchases"."product_id" = 1
WINDOW wnd AS (
   PARTITION BY address_id ORDER BY purchases.purchased_at DESC
   ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

Wyckoff answered 18/7, 2013 at 6:38 Comment(2)

It would be nice if someone explained the query. – Taunton 29/4, 2017 at 10:18

@Gajus: Short explanation: it doesn't work, only returns distinct address_id. The principle could work, though. Related examples: https://mcmap.net/q/28544/-custom-aggregate-function-in-postgresql or https://mcmap.net/q/28545/-retrieve-aggregates-for-arbitrary-time-intervals. But there are shorter and / or faster queries for the problem at hand. – Odaniel 17/7, 2017 at 15:56

For anyone using Flask-SQLAlchemy, this worked for me

from app import db
from app.models import Purchases
from sqlalchemy.orm import aliased
from sqlalchemy import desc

stmt = Purchases.query.distinct(Purchases.address_id).subquery('purchases')
alias = aliased(Purchases, stmt)
distinct = db.session.query(alias)
distinct.order_by(desc(alias.purchased_at))

Jaunita answered 25/4, 2017 at 13:41 Comment(5)

Yes, or even easier, I was able to use: query.distinct(foo).from_self().order(bar) – Flowerlike 4/1, 2018 at 14:46

@LaurentMeyer do you mean Purchases.query? – Jaunita 8/1, 2018 at 13:24

Yes, I meant Purchases.query – Flowerlike 8/1, 2018 at 14:14

That's weird, is from_self something undocumented? I cannot get it to work – Query and QuerySet raise AttributeError when I try to access it – Mcleod 15/4, 2021 at 23:11

Oh, wait, it's SQLAlchemy, not Django, silly me – Mcleod 15/4, 2021 at 23:12

It can also be solved using the following query along with other answers.

WITH purchase_data AS (
        SELECT address_id, purchased_at, product_id,
                row_number() OVER (PARTITION BY address_id ORDER BY purchased_at DESC) AS row_number
        FROM purchases
        WHERE product_id = 1)
SELECT address_id, purchased_at, product_id
FROM purchase_data where row_number = 1

Replication answered 27/11, 2020 at 10:15 Comment(0)

-3

You can also done this by using group by clause

   SELECT purchases.address_id, purchases.* FROM "purchases"
    WHERE "purchases"."product_id" = 1 GROUP BY address_id,
purchases.purchased_at ORDER purchases.purchased_at DESC

Walling answered 23/5, 2017 at 4:35 Comment(1)

This is incorrect (unless purchases has only the two columns address_id and purchased_at). Because of GROUP BY, you will need to use an aggregate function to get the value of each column not used for grouping, so they values will all be coming from different rows of the group unless you go through ugly and inefficient gymnastics. This can be fixed only by using window functions rather than GROUP BY. – Cullen 12/7, 2017 at 18:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags