SQL join: selecting the last records in a one-to-many relationship
Asked Answered
G

13

478

Suppose I have a table of customers and a table of purchases. Each purchase belongs to one customer. I want to get a list of all customers along with their last purchase in one SELECT statement. What is the best practice? Any advice on building indexes?

Please use these table/column names in your answer:

  • customer: id, name
  • purchase: id, customer_id, item_id, date

And in more complicated situations, would it be (performance-wise) beneficial to denormalize the database by putting the last purchase into the customer table?

If the (purchase) id is guaranteed to be sorted by date, can the statements be simplified by using something like LIMIT 1?

Garwin answered 21/1, 2010 at 17:29 Comment(2)
Yes, it might be worth denormalizing (if it improves performance a lot, which you can only find out by testing both versions). But the downsides of denormalization are usually worth avoiding.Albuquerque
Related: jan.kneschke.de/projects/mysql/groupwise-maxBeard
G
678

This is an example of the greatest-n-per-group problem that has appeared regularly on StackOverflow.

Here's how I usually recommend solving it:

SELECT c.*, p1.*
FROM customer c
JOIN purchase p1 ON (c.id = p1.customer_id)
LEFT OUTER JOIN purchase p2 ON (c.id = p2.customer_id AND 
    (p1.date < p2.date OR (p1.date = p2.date AND p1.id < p2.id)))
WHERE p2.id IS NULL;

Explanation: given a row p1, there should be no row p2 with the same customer and a later date (or in the case of ties, a later id). When we find that to be true, then p1 is the most recent purchase for that customer.

Regarding indexes, I'd create a compound index in purchase over the columns (customer_id, date, id). That may allow the outer join to be done using a covering index. Be sure to test on your platform, because optimization is implementation-dependent. Use the features of your RDBMS to analyze the optimization plan. E.g. EXPLAIN on MySQL.


Some people use subqueries instead of the solution I show above, but I find my solution makes it easier to resolve ties.

Gujarati answered 21/1, 2010 at 17:35 Comment(25)
Favorably, in general. But that depends on the brand of database you use, and the quantity and distribution of data in your database. The only way to get a precise answer is for you to test both solutions against your data.Gujarati
If you want to include customers who never made a purchase, then change JOIN purchase p1 ON (c.id = p1.customer_id) to LEFT JOIN purchase p1 ON (c.id = p1.customer_id)Akbar
I've implemented this solution along with the LEFT JOIN change from @GordonM. My issue now, is what if i have 2 identical rows. Is there a way to limit this to just return 1 row (doesn't matter which one)? Great discussion btw.Sihunn
@russds, you need some unique column you can use to resolve the tie. It makes no sense to have two identical rows in a relational database.Gujarati
Example with subquery is more preferrable for speedNonexistence
What is the purpose of "WHERE p2.id IS NULL"?Midian
@clu: see explanation in my answer to #121887Gujarati
@Nonexistence which example? and how do you know?Monicamonie
this solution works only, if there are more than 1 purchase records. ist there is 1:1 link, it does NOT work. there it has to be "WHERE (p2.id IS NULL or p1.id=p2.id)Meatus
Why not do p1.date <= p2.date and get rid of the or in the outer join? I'm applying this to hive, where or isn't allowed in a join. The or seems redundant in any case.Undernourished
@Someguywhocodes, You should review operator precedence of boolean operations. A OR B AND C is not the same as (A OR B) AND C.Gujarati
worked well for me and this has the fastest performance compared to other examples on this thread for the data set i haveQuartz
@b.lit I believe the purpose of the "WHERE p2.id IS NULL" is to isolate the last record in the purchase table. When we reach the end of the table, p1 points to the last record and p2 points to the next record. The last record has no next record so the id of that record is null.Mcmurray
@Patrick, You are correct, but FYI you are replying to a comment from 2015, so if that user needed to know the answer, they probably have gotten it by now. :-)Gujarati
@BrunoJennrich I found out it only works if there is more then 1 record in the one to many relationship. I did the query and since many of mine only have 1 record it was missing a lotCatlaina
I tried your solution and found it ve-e-ery slow for large tables with billions of rows.Fabria
@Fabria Was any other solution not very slow for a table with billions of rows?Gujarati
@BillKarwin Unfortunately, no. Thank you, your solution works well for relatively small tables. For example, it worked well for tables with thousands of rows. But it starts to work very slow for tables with hundreds of millions rows. I work with Postgresql and use indexes, of course.Fabria
@Fabria Well one thing I've learned from a lot of performance consulting is that as the scale of your application gets larger, different solutions must be used. The guideline we used at Percona was that for every order of magnitude (10x) growth in traffic or data size, you need to reevaluate your architecture and decide if it's time to refactor it to some different implementation. There is hardly ever a "one size fits all" solution.Gujarati
What about performance between subquery or your query ?Person
@ZanyarJ.Ahmed #1313620Gujarati
Your query is very slow please refer this question: #77929611Milanmilanese
@Milanmilanese You referred me to a question, then you deleted that question.Gujarati
@BillKarwin The query you provided is using a self-join to find the latest record for each id in the table. Self-joins can be computationally expensive, especially when dealing with large datasets. To optimize the query and improve performance, you can consider using a window function like ROW_NUMBER() to rank the records within each id group and then filter out only the latest record for each id. I used your query for 5.000 Records it takes about 20 seconds !Milanmilanese
@Milanmilanese I invite you to read my answer here: #1313620Gujarati
G
196

You could also try doing this using a sub select

SELECT  c.*, p.*
FROM    customer c INNER JOIN
        (
            SELECT  customer_id,
                    MAX(date) MaxDate
            FROM    purchase
            GROUP BY customer_id
        ) MaxDates ON c.id = MaxDates.customer_id INNER JOIN
        purchase p ON   MaxDates.customer_id = p.customer_id
                    AND MaxDates.MaxDate = p.date

The select should join on all customers and their Last purchase date.

Guffaw answered 21/1, 2010 at 17:40 Comment(5)
@clu: Change the INNER JOIN to a LEFT OUTER JOIN.Carvey
Looks like this assumes there is only one purchase on that day. If there were two you would get two output rows for one customer, I think?Bonnell
Why can't we do without the last INNER JOIN ?Xenophobia
@IstiaqueAhmed - the last INNER JOIN takes that Max(date) value and ties it back to the source table. Without that join, the only information you would have from the purchase table are the date and the customer_id, but the query asks for all fields from the table.Sever
What if there are customers that don't have a purchase yet... does this remove them from the list because it is an inner join?Coparcener
T
58

Another approach would be to use a NOT EXISTS condition in your join condition to test for later purchases:

SELECT *
FROM customer c
LEFT JOIN purchase p ON (
       c.id = p.customer_id
   AND NOT EXISTS (
     SELECT 1 FROM purchase p1
     WHERE p1.customer_id = c.id
     AND p1.id > p.id
   )
)
Tsar answered 12/6, 2013 at 8:10 Comment(5)
Can you explain the AND NOT EXISTS part in easy words ?Xenophobia
The sub select just checks if there‘s a row with a higher id. You’ll only get a row in your result set, if none with higher id is found. That should be the unique highest one.Tsar
When Id is a uniqueidentifier (guid), this cannot be used.Monoplane
@andrew.fox: Correct. You definitely need a consistent way to establish which record was inserted last. This will fail with the ID, if your IDs are random and not incremental. In that case replace the > comparison of ids with some timestamp column, but be sure to put an index on that column for performance reasons.Tsar
@Monoplane Note, that the OP stated: "If the (purchase) id is guaranteed to be sorted by date..."Tsar
M
53

If you're using PostgreSQL you can use DISTINCT ON to find the first row in a group.

SELECT customer.*, purchase.*
FROM customer
JOIN (
   SELECT DISTINCT ON (customer_id) *
   FROM purchase
   ORDER BY customer_id, date DESC
) purchase ON purchase.customer_id = customer.id

PostgreSQL Docs - Distinct On

Note that the DISTINCT ON field(s) -- here customer_id -- must match the left most field(s) in the ORDER BY clause.

Caveat: This is a nonstandard clause.

Minatory answered 20/6, 2018 at 5:8 Comment(1)
how is it different from JOIN (SELECT * FROM purchase WHERE customer.customer_id=purchase.customer_id ORDER BY customer_id, date DESC LIMIT 1)Stoma
T
35

You haven't specified the database. If it is one that allows analytical functions it may be faster to use this approach than the GROUP BY one(definitely faster in Oracle, most likely faster in the late SQL Server editions, don't know about others).

Syntax in SQL Server would be:

SELECT c.*, p.*
FROM customer c INNER JOIN 
     (SELECT RANK() OVER (PARTITION BY customer_id ORDER BY date DESC) r, *
             FROM purchase) p
ON (c.id = p.customer_id)
WHERE p.r = 1
Teleost answered 21/1, 2010 at 18:23 Comment(3)
This is the wrong answer to the question because you are using "RANK()" instead of "ROW_NUMBER()". RANK will still give you the same problem of ties when two purchases have the exact same date. That's what the Ranking function does; if the top 2 match, they both get assigned the value of 1 and the 3rd record gets a value of 3. With Row_Number, there is no tie, it is unique for the entire partition.Faludi
Trying Bill Karwin's approach against Madalina's approach here, with execution plans enabled under sql server 2008 I found Bill Karwin's apprach had a query cost of 43% as opposed to Madalina's approach which used 57%- so despite the more elegant syntax of this answer, I would still favour Bill's version!Hagioscope
Yes, there must be ‘ROW_NUMBER()’ instead. However, nice work @MadalinaAutograft
R
34

I found this thread as a solution to my problem.

But when I tried them the performance was low. Bellow is my suggestion for better performance.

With MaxDates as (
SELECT  customer_id,
                MAX(date) MaxDate
        FROM    purchase
        GROUP BY customer_id
)

SELECT  c.*, M.*
FROM    customer c INNER JOIN
        MaxDates as M ON c.id = M.customer_id 

Hope this will be helpful.

Rajkot answered 23/10, 2014 at 6:29 Comment(6)
to get only 1 i used top 1 and ordered it by MaxDate descPileup
this is easy and straightforward solution, in MY case (many customers, few purchases) 10% faster then @Stefan Haberl's solution and more than 10 times better than accepted answerDecosta
Great suggestion using common table expressions (CTE) to solve this problem. This has dramatically improved the performance of queries in many situations.Jehial
Best answer imo, easy to read, the MAX() clause gives great performance comparted to ORDER BY + LIMIT 1Dihybrid
Wrong answer. It provides just latest date column from the table purchase. OP was asking for the whole recordRochet
Only provides the date not the whole recordCatlaina
C
13

Try this, It will help.

I have used this in my project.

SELECT 
*
FROM
customer c
OUTER APPLY(SELECT top 1 * FROM purchase pi 
WHERE pi.customer_id = c.Id order by pi.Id desc) AS [LastPurchasePrice]
Cannula answered 16/1, 2018 at 5:7 Comment(3)
Where do the alias "p" comes from?Ursal
this doesnt perform well.... took forever where other examples here took 2 seconds on the data set i have....Quartz
This was the most performant option for my data set.Modesta
G
7

I needed what you needed, albeit many years later, and tried the two most popular answers. These did not yield the desired fruit. So this is what I have to offer... For clarity, I changed some names.

SELECT 
  cc.pk_ID AS pk_Customer_ID, 
  cc.Customer_Name AS Customer_Name, 
  IFNULL(pp.pk_ID, '') AS fk_Purchase_ID,
  IFNULL(pp.fk_Customer_ID, '') AS fk_Customer_ID,
  IFNULL(pp.fk_Item_ID, '') AS fk_Item_ID,
  IFNULL(pp.Purchase_Date, '') AS Purchase_Date
FROM customer cc
LEFT JOIN purchase pp ON (
  SELECT zz.pk_ID 
  FROM purchase zz 
  WHERE cc.pk_ID = zz.fk_Customer_ID 
  ORDER BY zz.Purchase_Date DESC LIMIT 1) = pp.pk_ID
ORDER BY cc.pk_ID;
Grouping answered 23/3, 2021 at 22:7 Comment(2)
Thank you brother. This is working perfectlyJacqulynjactation
I have a condition where I have to join many tables and there are at 2 where I used one to many relationship. this actually solved my problemJacqulynjactation
W
7

On SQL Server you could use:

SELECT *
FROM customer c
INNER JOIN purchase p on c.id = p.customer_id
WHERE p.id = (
    SELECT TOP 1 p2.id
    FROM purchase p2
    WHERE p.customer_id = p2.customer_id
    ORDER BY date DESC
)

SQL Server Fiddle: http://sqlfiddle.com/#!18/262fd/2

On MySQL you could use:

SELECT c.name, date
FROM customer c
INNER JOIN purchase p on c.id = p.customer_id
WHERE p.id = (
    SELECT p2.id
    FROM purchase p2
    WHERE p.customer_id = p2.customer_id
    ORDER BY date DESC
    LIMIT 1
)

MySQL Fiddle: http://sqlfiddle.com/#!9/202613/7

Westward answered 8/4, 2021 at 20:41 Comment(0)
S
5

Tested on SQLite:

SELECT c.*, p.*, max(p.date)
FROM customer c
LEFT OUTER JOIN purchase p
ON c.id = p.customer_id
GROUP BY c.id

The max() aggregate function will make sure that the latest purchase is selected from each group (but assumes that the date column is in a format whereby max() gives the latest - which is normally the case). If you want to handle purchases with the same date then you can use max(p.date, p.id).

In terms of indexes, I would use an index on purchase with (customer_id, date, [any other purchase columns you want to return in your select]).

The LEFT OUTER JOIN (as opposed to INNER JOIN) will make sure that customers that have never made a purchase are also included.

Supposal answered 27/1, 2018 at 4:25 Comment(2)
wont run in t-sql as the select c.* has columns not in the group by clauseQuartz
I also find this works in SQLite. I scoured it documentation (which is extremely comprehensive) for some note saying that it should work but couldn't find anything. So there's no guarantee that it will work in future updates (unless you can find something I've missed).Dovecote
T
3

Tables :

Customer => id, name
Purchase => id, customer_id, item_id, date

Query :

SELECT C.id, C.name, P.id, P.date
  FROM customer AS C
  LEFT JOIN purchase AS P ON 
    (
      P.customer_id = C.id 
      AND P.id IN (
        SELECT MAX(PP.id) FROM purchase AS PP GROUP BY PP.customer_id
      )
    )

You can also specify some condition into sub select query

Trichromatic answered 13/8, 2021 at 16:47 Comment(0)
R
2

Please try this,

SELECT 
c.Id,
c.name,
(SELECT pi.price FROM purchase pi WHERE pi.Id = MAX(p.Id)) AS [LastPurchasePrice]
FROM customer c INNER JOIN purchase p 
ON c.Id = p.customerId 
GROUP BY c.Id,c.name;
Richburg answered 25/6, 2016 at 9:25 Comment(0)
D
1

Without getting into the code first, the logic/algorithm goes below:

  1. Go to the transaction table with multiple records for the same client.

  2. Select records of clientID and the latestDate of client's activity using group by clientID and max(transactionDate)

       select clientID, max(transactionDate) as latestDate 
       from transaction 
       group by clientID
    
  3. inner join the transaction table with the outcome from Step 2, then you will have the full records of the transaction table with only each client's latest record.

       select * from 
       transaction t 
       inner join (
         select clientID, max(transactionDate) as latestDate
         from transaction 
         group by clientID) d 
       on t.clientID = d.clientID and t.transactionDate = d.latestDate) 
    
  4. You can use the result from step 3 to join any table you want to get different results.

Decompress answered 5/10, 2020 at 20:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.