Doing a market basket analysis on the order details
Asked Answered
D

2

7

I have a table that looks (abbreviated) like:

| order_id  | item_id   | amount    | qty   | date          |
|---------- |---------  |--------   |-----  |------------   |
| 1         | 1         | 10        | 1     | 10-10-2014    |
| 1         | 2         | 20        | 2     | 10-10-2014    |
| 2         | 1         | 10        | 1     | 10-12-2014    |
| 2         | 2         | 20        | 1     | 10-12-2014    |
| 2         | 3         | 45        | 1     | 10-12-2014    |
| 3         | 1         | 10        | 1     | 9-9-2014      |
| 3         | 3         | 45        | 1     | 9-9-2014      |
| 4         | 2         | 20        | 1     | 11-11-2014    |

I would like to run a query that would calculate the list of items that most frequently occur together.

In this case the result would be:

|items|frequency|
|-----|---------|
|1,2, |2        |
|1,3  |1        |
|2,3  |1        |
|2    |1        |

Ideally, first presenting orders with more than one items, then presenting the most frequently ordered single items.

Could anyone please provide an example for how to structure this SQL?

Derina answered 5/9, 2015 at 1:59 Comment(4)
Does date ordering matter?Hoagy
It should, if there is a tie.Derina
you have other softwares like R, SAS Enterprise Miner to do this more efficiently(as there are optimized algorithms written for you already). Writing a code/procedure in SQL is an overkill. This is because you can't always be sure of the number of combinations of different items (n-itemset) that would be bought together. And that would vary by day.Astir
What DB engine do you use?Rewrite
A
3

This query generate all of the requested output, in the cases where 2 items occur together. It doesn't include the last item of the requested output since a single value (2) technically doesn't occur together with anything... although you could easily add a UNION query to include values that happen alone.

This is written for PostgreSQL 9.3

 create table orders(
        order_id int, 
        item_id int, 
        amount int, 
        qty int, 
        date timestamp


);

INSERT INTO ORDERS VALUES(1,1,10,1,'10-10-2014');
INSERT INTO ORDERS VALUES(1,2,20,1,'10-10-2014');
INSERT INTO ORDERS VALUES(2,1,10,1,'10-12-2014');
INSERT INTO ORDERS VALUES(2,2,20,1,'10-12-2014');
INSERT INTO ORDERS VALUES(2,3,45,1,'10-12-2014');
INSERT INTO ORDERS VALUES(3,1,10,1,'9-9-2014');
INSERT INTO ORDERS VALUES(3,3,45,1,'9-9-2014');
INSERT INTO ORDERS VALUES(4,2,10,1,'11-11-2014');

with order_pairs as (
    select (pg1.item_id, pg2.item_id) as items, pg1.date
    from 
    (select distinct item_id, date
    from orders) as pg1
    join
    (select distinct item_id, date
    from orders) as pg2
    ON 
    (
    pg1.date = pg2.date AND
    pg1.item_id != pg2.item_id AND
    pg1.item_id < pg2.item_id

    )
    )

    SELECT items, count(*) as frequency
    FROM order_pairs
    GROUP by items
    ORDER by items;

output

 items | frequency 
-------+-----------
 (1,2) |         2
 (1,3) |         2
 (2,3) |         1
(3 rows)
Actinotherapy answered 5/9, 2015 at 3:25 Comment(2)
Any reason to use date instead of order_id?Fleuron
@sgarman, I think you're right. It should use order_id.Actinotherapy
B
0

Market Basket Analysis with Join. Join on order_id and compare if item_id < self.item_id. So for every item_id you get its associated items sold. And then group by items and count the number of rows for each combinations.

select items,count(*) as 'Freq' from 
(select concat(x.item_id,',',y.item_id) as items from orders x 
JOIN orders y ON x.order_id = y.order_id and 
x.item_id != y.item_id and x.item_id < y.item_id) A 
group by A.items order by A.items;
Beatific answered 7/2, 2017 at 17:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.