Rails - Distinct ON after a join
Asked Answered
W

5

5

I am using Rails 4.2 with PostgreSQL. I have a Product model and a Purchase model with Product has many Purchases. I want to find the distinct recently purchased products. Initially I tried:

Product.joins(:purchases)
.select("DISTINCT products.*, purchases.updated_at") #postgresql requires order column in select
.order("purchases.updated_at DESC")

This however results in duplicates because it tries to find all tuples where the pair (product.id and purchases.updated_at) has a unique value. However I just want to select the products with distinct id after the join. If a product id appears multiple times in the join, only select the first one. So I also tried:

Product.joins(:purchases)
.select("DISTINCT ON (product.id) purchases.updated_at, products.*")
.order("product.id, purchases.updated_at") #postgres requires that DISTINCT ON must match the leftmost order by clause

This doesn't work because I need to specify product.id in the order clause because of this constraint which outputs unexpected order.

What is the rails way to achieve this?

Westmoreland answered 25/9, 2015 at 5:9 Comment(0)
W
3

Use a subquery and add a different ORDER BY clause in the outer SELECT:

SELECT *
FROM  (
   SELECT DISTINCT ON (pr.id)
          pu.updated_at, pr.*
   FROM   Product pr
   JOIN   Purchases pu ON pu.product_id = pr.id  -- guessing
   ORDER  BY pr.id, pu.updated_at DESC NULLS LAST
   ) sub
ORDER  BY updated_at DESC NULLS LAST;

Details for DISTINCT ON:

Or some other query technique:

But if all you need from Purchases is updated_at, you can get this cheaper with a simple aggregate in a subquery before you join:

SELECT *
FROM   Product pr
JOIN  (
   SELECT product_id, max(updated_at) AS updated_at
   FROM   Purchases 
   GROUP  BY 1
   ) pu ON pu.product_id = pr.id  -- guessing
ORDER  BY pu.updated_at DESC NULLS LAST;

About NULLS LAST:

Or even simpler, but not as fast while retrieving all rows:

SELECT pr.*, max(updated_at) AS updated_at
FROM   Product pr
JOIN   Purchases pu ON pu.product_id = pr.id
GROUP  BY pr.id  -- must be primary key
ORDER  BY 2 DESC NULLS LAST;

Product.id needs to be defined as primary key for this to work. Details:

If you fetch only a small selection (with a WHERE clause restricting to just one or a few pr.id for instance), this will be faster.

Whitewing answered 25/9, 2015 at 5:27 Comment(5)
Is there a rails way of doing this?Westmoreland
@zack: There sure is. Should be easy to translate, but I am no Rails expert. Ultimately, the code has to be translated to SQL before being sent to the DB server.Whitewing
@zack, well, you'll likely have to use SQL strings. The SQL above uses features that, I believe, Arel doesn't support. So you'll have to use things like .select('DISTINCT ON (... and .order('whatever DESC NULLS FIRST'). Or perhaps you could hack on Arel to support these modifiers, but that would take quite a bit of time. Arel is under the hood of ActiveRecord in Rails.Collocutor
@Collocutor I figured out the way to do it. Please check my answer.Westmoreland
@zack: I added another alternative.Whitewing
W
5

So building on @ErwinBrandstetter answer, I finally found the right way of doing this. The query to find distinct recent purchases is

SELECT *
FROM  (
   SELECT DISTINCT ON (pr.id)
          pu.updated_at, pr.*
   FROM   Product pr
   JOIN   Purchases pu ON pu.product_id = pr.id
   ) sub
ORDER  BY updated_at DESC NULLS LAST;

The order_by isn't needed inside the subquery, since we are anyway ordering in the outer query.

The rails way of doing this is -

inner_query = Product.joins(:purchases)
  .select("DISTINCT ON (products.id) products.*, purchases.updated_at as date") #This selects all the unique purchased products.

result = Product.from("(#{inner_query.to_sql}) as unique_purchases")
  .select("unique_purchases.*").order("unique_purchases.date DESC")

The second (and better) way to do this as suggested by @ErwinBrandstetter is

SELECT *
FROM   Product pr
JOIN  (
   SELECT product_id, max(updated_at) AS updated_at
   FROM   Purchases 
   GROUP  BY 1
   ) pu ON pu.product_id = pr.id
ORDER  BY pu.updated_at DESC NULLS LAST;

which can written in rails as

join_query = Purchase.select("product_id, max(updated_at) as date")
  .group(1) #This selects most recent date for all purchased products

result = Product.joins("INNER JOIN (#{join_query.to_sql}) as unique_purchases ON products.id = unique_purchases.product_id")
  .order("unique_purchases.date")
Westmoreland answered 25/9, 2015 at 17:21 Comment(1)
As for the 1st query: The order_by isn't needed inside the subquery. While this typically works, it's just an implementation detail that can break any time. You need ORDER BY in the subquery to guarantee the latest row. Postgres is free to return any row for each pr.id otherwise.Whitewing
W
3

Use a subquery and add a different ORDER BY clause in the outer SELECT:

SELECT *
FROM  (
   SELECT DISTINCT ON (pr.id)
          pu.updated_at, pr.*
   FROM   Product pr
   JOIN   Purchases pu ON pu.product_id = pr.id  -- guessing
   ORDER  BY pr.id, pu.updated_at DESC NULLS LAST
   ) sub
ORDER  BY updated_at DESC NULLS LAST;

Details for DISTINCT ON:

Or some other query technique:

But if all you need from Purchases is updated_at, you can get this cheaper with a simple aggregate in a subquery before you join:

SELECT *
FROM   Product pr
JOIN  (
   SELECT product_id, max(updated_at) AS updated_at
   FROM   Purchases 
   GROUP  BY 1
   ) pu ON pu.product_id = pr.id  -- guessing
ORDER  BY pu.updated_at DESC NULLS LAST;

About NULLS LAST:

Or even simpler, but not as fast while retrieving all rows:

SELECT pr.*, max(updated_at) AS updated_at
FROM   Product pr
JOIN   Purchases pu ON pu.product_id = pr.id
GROUP  BY pr.id  -- must be primary key
ORDER  BY 2 DESC NULLS LAST;

Product.id needs to be defined as primary key for this to work. Details:

If you fetch only a small selection (with a WHERE clause restricting to just one or a few pr.id for instance), this will be faster.

Whitewing answered 25/9, 2015 at 5:27 Comment(5)
Is there a rails way of doing this?Westmoreland
@zack: There sure is. Should be easy to translate, but I am no Rails expert. Ultimately, the code has to be translated to SQL before being sent to the DB server.Whitewing
@zack, well, you'll likely have to use SQL strings. The SQL above uses features that, I believe, Arel doesn't support. So you'll have to use things like .select('DISTINCT ON (... and .order('whatever DESC NULLS FIRST'). Or perhaps you could hack on Arel to support these modifiers, but that would take quite a bit of time. Arel is under the hood of ActiveRecord in Rails.Collocutor
@Collocutor I figured out the way to do it. Please check my answer.Westmoreland
@zack: I added another alternative.Whitewing
E
2

To build on erwin-brandstetter's answer, this is how you could do this with ActiveRecord (should be close at least):

Product
  .select('*')
  .joins('INNER JOIN (SELECT product_id, max(updated_at) AS updated_at FROM Purchases GROUP  BY 1) pu ON pu.product_id = pr.id')
  .order('pu.updated_at DESC NULLS LAST')
Exile answered 25/9, 2015 at 15:46 Comment(2)
should be .joins('INNER JOIN (SELECT product_id ... ) check my answer.Westmoreland
Thanks zach. I've updated my answer. Did that work as you intended?Exile
F
1

Try to do this:

Product.joins(:purchases)
.select("DISTINCT ON (products_id) purchases.product_id, purchases.updated_at, products.*")
.order("product_id, purchases.updated_at") #postgres requires that DISTINCT ON must match the leftmost order by clause
Frisbee answered 25/9, 2015 at 7:30 Comment(5)
This would result in duplicates. A product id might appear multiple times in purchases table each with different updated_at value. I want to select the product just once,Westmoreland
How is grouping by products.id different from purchases.product_id?Westmoreland
Sorry about that. Both of the are equal. I have updated again based on your code.Frisbee
This is same as what I have in my question.Westmoreland
Product.joins(:purchases) .select("DISTINCT ON (products.id) purchases.product_id, purchases.updated_at, products.*") .order("product_id, purchases.updated_at")Randi
W
0

I ended up with this -

Product.joins(:purchases)
.select("DISTINCT ON (products.id) products.*, purchases.updated_at as date")
.sort_by(&:date)
.reverse

Still looking for a better way to do this.

Westmoreland answered 25/9, 2015 at 12:8 Comment(4)
Half of the job is done by Ruby here, which is much less than ideal.Collocutor
@Collocutor the other answer. #32775720Westmoreland
Whatever. I said you'd have to use SQL strings and you actually do =)Collocutor
@Collocutor Well I don't see a better way since they are not supported by rails.Westmoreland

© 2022 - 2024 — McMap. All rights reserved.