How to create a co-occurence matrix of product orders in python?
Asked Answered
H

3

9

Let's assume we have the following dataframe that includes customer orders (order_id) and the products that the individual order contained (product_id):

import pandas as pd

df = pd.DataFrame({'order_id' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3],
                   'product_id' : [365, 48750, 3333, 9877, 48750, 32001, 3333, 3333, 365, 11202, 365]})
print(df)

   order_id product_id
0         1        365
1         1      48750
2         1       3333
3         1       9877
4         2      48750
5         2      32001
6         2       3333
7         3       3333
8         3        365
9         3      11202
10        3        365

It would be interesting to know how often product pairs appeared together in the same basket.

How does one create a co-occurence matrix in python that looks like this:

       365  48750  3333  9877  32001  11202
365      1      1     2     1      0      1
48750    1      0     2     1      1      0
3333     2      2     0     1      1      1
9877     1      1     1     0      0      0
32001    0      1     1     0      0      0
11202    1      0     1     0      0      0

I would be very thankful for your help!

Heyerdahl answered 14/11, 2020 at 18:18 Comment(2)
It would be helpful if you listed what you have tried so far.Khartoum
@Khartoum In R I could have done that, but I'm new to Python and didn't even know where to start. Sorry for the inconvenience.Heyerdahl
A
5

We start by grouping the df by order_id, and within each group calculate all possible pairs. Note we sort first by product_id so the same pairs in different groups are always in the same order

import itertools
all_pairs = []
for _, group in df.sort_values('product_id').groupby('order_id'):
    all_pairs += list(itertools.combinations(group['product_id'],2))

all_pairs

we get a list of all pairs from all orders

[('3333', '365'),
 ('3333', '48750'),
 ('3333', '9877'),
 ('365', '48750'),
 ('365', '9877'),
 ('48750', '9877'),
 ('32001', '3333'),
 ('32001', '48750'),
 ('3333', '48750'),
 ('11202', '3333'),
 ('11202', '365'),
 ('11202', '365'),
 ('3333', '365'),
 ('3333', '365'),
 ('365', '365')]

Now we count duplicates

from collections import Counter

count_dict = dict(Counter(all_pairs))
count_dict

so we get the count of each pair, basically what you are after

{('3333', '365'): 3,
 ('3333', '48750'): 2,
 ('3333', '9877'): 1,
 ('365', '48750'): 1,
 ('365', '9877'): 1,
 ('48750', '9877'): 1,
 ('32001', '3333'): 1,
 ('32001', '48750'): 1,
 ('11202', '3333'): 1,
 ('11202', '365'): 2,
 ('365', '365'): 1}

Putting this back into a cross-product table is a bit of work, the key bit is spliitng the tuples into columns by calling .apply(pd.Series) and eventually moving one of the columns to the column names via unstack:

(pd.DataFrame.from_dict(count_dict, orient='index')
    .reset_index(0)
    .set_index(0)['index']
    .apply(pd.Series)
    .rename(columns = {0:'pid1',1:'pid2'})
    .reset_index()
    .rename(columns = {0:'count'})
    .set_index(['pid1', 'pid2'] )
    .unstack()
    .fillna(0))

this produces a 'compact' form of the table you are after that only includes products that appeared in at least one pair


count
pid2    3333 365    48750  9877
pid1                
11202   1.0  2.0    0.0    0.0
32001   1.0  0.0    1.0    0.0
3333    0.0  3.0    2.0    1.0
365     0.0  1.0    1.0    1.0
48750   0.0  0.0    0.0    1.0

UPDATE Here is a rather simplified version of the above, following various discussions in the comments

import numpy as np
import pandas as pd
from collections import Counter

# we start as in the original solution but use permutations not combinations
all_pairs = []
for _, group in df.sort_values('product_id').groupby('order_id'):
    all_pairs += list(itertools.permutations(group['product_id'],2))
count_dict = dict(Counter(all_pairs))

# We create permutations for _all_ product_ids ... note we use unique() but also product(..) to allow for (365,265) combinations
total_pairs = list(itertools.product(df['product_id'].unique(),repeat = 2))

# pull out first and second elements separately
pid1 = [p[0] for p in total_pairs]
pid2 = [p[1] for p in total_pairs]

# and get the count for those permutations that exist from count_dict. Use 0
# for those that do not
count = [count_dict.get(p,0) for p in total_pairs]

# Now a bit of dataFrame magic
df_cross = pd.DataFrame({'pid1':pid1, 'pid2':pid2, 'count':count})
df_cross.set_index(['pid1','pid2']).unstack()

and we are done. df_cross below


count
pid2    11202   32001   3333    365 48750   9877
pid1                        
11202   0       0       1       2   0       0
32001   0       0       1       0   1       0
3333    1       1       0       3   2       1
365     2       0       3       2   1       1
48750   0       1       2       1   0       1
9877    0       0       1       1   1       0
Abba answered 14/11, 2020 at 19:24 Comment(8)
Clever solution!Histrionics
@Abba Thank you very much for your answer! I was able to make your approach work. The count_dict table is already really useful. Unfortunately the final table is a little confusing. For example the value (365, 3333) is 0, whereas (3333, 365) is the correct 3. Is there some way of creating an n x n Matrix, with n being the number of products, that is mirrored on the diagonal?Heyerdahl
glad it was of some help! And sorry I did not get the answer into the final shape. I will think about it some moreAbba
@Abba I was able to get there by using itertools.permutations instead of itertools.combinations, which makes sense if you think about it. It does count the 365 twice, unlike in my table, but that is actually a good thing, and you don't need to sort by product_id. Thank you very much again, you have helped me loads!Heyerdahl
Awesome, and thank you for letting me know, much appreciatedAbba
I have updated the answer with a solution that is much cleaner and builds on top of your permutations idea. I think I will delete the first, messy, part of if you don't mindAbba
@Abba Feel free to do soHeyerdahl
Although I just tested it, and your new solution is significantly more memory intensive, so you might want to leave the old one in or create a new commentHeyerdahl
H
0

This should be a good starting point and maybe can be useful

pd.crosstab(df['order_id '], df['product_id'])

product_id  365    3333   9877   11202  32001  48750
order_id 
1            1      1      1      0      0      1
2            0      1      0      0      1      1
3            2      1      0      1      0      0
Heresiarch answered 14/11, 2020 at 22:44 Comment(0)
N
0

Pivot so that each row corresponds to a product, then map each row to (df * row > 0).sum(1), which indicates number of orders in which the product co-occurs with each of the other products.

>>> df = df.pivot_table(index='product_id', columns='order_id', aggfunc='size')
>>> co_occ = df.apply(lambda row: (df * row > 0).sum(1), axis=1)
>>> co_occ
product_id  365    3333   9877   11202  32001  48750
product_id                                          
365             2      2      1      1      0      1
3333            2      3      1      1      1      2
9877            1      1      1      0      0      1
11202           1      1      0      1      0      0
32001           0      1      0      0      1      1
48750           1      2      1      0      1      2

The diagonal can be modified to the convention implied by the sample output (that a product co-occurs with itself if there is at least two of it in the same order) with np.fill_diagonal(co_occ.values, (df - 1).sum(1)).

Notation answered 15/11, 2020 at 3:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.