Count common sets of items between different customers
Asked Answered
W

4

7

I have data on customers and the different products they have purchased:

Customer    Product
   1           A
   1           B
   1           C
   2           D
   2           E
   2           F
   3           A
   3           B
   3           D
   4           A
   4           B

I would like to check which sets of products that occur together across different customers. I want to get the count for product combinations of different lengths. For example, the product combination A and B together occurs in three different customers; the product group A, B and C occurs in one customer. And so on for all different sets of 2 or more products in the data. Something like:

Product Group    Number
A, B, C             1
D, E, F             1
A, B, D             1
A, B                3

Thus, I'm counting the A, B combination in customers who only have product A and B (e.g. customer 4), and in customers who have A and B, but also any other product (e.g. customer 1, who has A, B and C).

Does anyone have any ideas how to do that with either a tidyverse or base R approach? I feel like it ought to be pretty trivial - maybe pivot_wider first, then count?

I have found this question and answer that can do what I need for pairs of products, but I need to count combinations also for more products than two.

Welton answered 9/8, 2020 at 7:51 Comment(1)
Here's a related question: #60038173Moneymaker
G
4

If you have the possibility to use a non-base package, you can use a tool dedicated for the task of finding item sets: arules::apriori. It is much faster on larger data sets.

library(arules)

# coerce data frame to binary incidence matrix
# use apriori to get "frequent itemsets"
r = apriori(data = as.matrix(table(dat) > 0),

# set: type of association mined, minimal support needed of an item set, 
# minimal number of items per item set  
            par = list(target = "frequent itemsets",
                       support = 0,
                       minlen = 2))

# coerce itemset to data.frame, select relevant rows and columns 
d = as(r, "data.frame")
d[d$count > 0, c("items", "count")]

#      items count
# 4    {B,C}     1
# 5    {A,C}     1
# 6    {E,F}     1
# 7    {D,E}     1
# 10   {D,F}     1
# 13   {B,D}     1
# 14   {A,D}     1
# 15   {A,B}     3
# 25 {A,B,C}     1
# 26 {D,E,F}     1
# 35 {A,B,D}     1

Timing on larger data set: 10000 customers with up to 6 products each. apriori is quite a lot faster.

# Unit: milliseconds
#              expr        min        lq       mean     median         uq        max neval
#     f_henrik(dat)   38.95475   39.8621   41.44454   40.67313   41.05565   57.64655    20
#      f_allan(dat) 4578.20595 4622.2363 4664.57187 4654.58713 4679.78119 4924.22537    20
#        f_jay(dat) 2799.10516 2939.9727 2995.90038 2971.24127 2999.82019 3444.70819    20
#     f_uwe_dt(dat) 2943.26219 3007.1212 3028.37550 3027.46511 3060.38380 3076.25664    20
#  f_uwe_dplyr(dat) 6339.03141 6375.7727 6478.77979 6448.56399 6521.54196 6816.09911    20

10000 customers with up to 10 products each. apriori is several hundred times faster.

# Unit: milliseconds
#             expr         min          lq        mean      median          uq         max neval
#    f_henrik(dat)    58.40093    58.95241    59.71129    59.63988    60.43591    61.21082    20
#       f_jay(dat) 52824.67760 53369.78899 53760.43652 53555.69881 54049.91600 55605.47980    20
#    f_uwe_dt(dat) 22612.87954 22820.12012 22998.85072 22974.32710 23220.00390 23337.22815    20
# f_uwe_dplyr(dat) 26083.20240 26255.88861 26445.49295 26402.67887 26659.81195 27046.83491    20

On the larger data set, Allan's code gave warnings (In rawToBits(as.raw(x)) : out-of-range values treated as 0 in coercion to raw) on the toy data, which seemed to affect the result. Thus, it is not included in the second benchmark.


Data and benchmark code:

set.seed(3) 
n_cust = 10000
n_product = sample(2:6, n_cust, replace = TRUE) # 2:10 in second run
dat = data.frame(
  Customer = rep(1:n_cust, n_product),
  Product = unlist(lapply(n_product, function(n) sample(letters[1:6], n)))) # 1:10 in 2nd run

library(microbenchmark)
res = microbenchmark(f_henrik(dat),
                     f_allan(dat),
                     f_jay(dat),
                     f_uwe_dt(dat),
                     f_uwe_dplyr(dat),
                     times = 20L)

Check for equality:

henrik = f_henrik(dat)
allan = f_allan(dat)
jay = f_jay(dat)
uwe_dt = f_uwe_dt(dat)
uwe_dplyr = f_uwe_dplyr(dat)

# change outputs to common format for comparison
# e.g. string format, column names, order
henrik$items = substr(henrik$items, 2, nchar(henrik$items) - 1)
henrik$items = gsub(",", ", ", henrik$items)

l = list(
  henrik = henrik, allan = allan, jay = jay, uwe_dt = uwe_dt, uwe_dplyr = uwe_dplyr)
l = lapply(l, function(d){
  d = setNames(as.data.frame(d), c("items", "count"))
  d = d[order(d$items), ]
  row.names(d) = NULL
  d
})

all.equal(l[["henrik"]], l[["allan"]])
# TRUE
all.equal(l[["henrik"]], l[["jay"]])
# TRUE
all.equal(l[["henrik"]], l[["uwe_dt"]])
# TRUE
all.equal(l[["henrik"]], l[["uwe_dplyr"]])
# TRUE

Functions:

f_henrik = function(dat){
  r = apriori(data = as.matrix(table(dat) > 0),
              par = list(target = "frequent itemsets",
                         support = 0,
                         minlen = 2))
  d = as(r, "data.frame")
  d[d$count > 0, c("items", "count")]
}

f_allan = function(dat){
  all_multiples <- function(strings)
  {
    n <- length(strings)
    do.call("c", sapply(1:2^n, function(x) {
      mystrings <- strings[as.character(rawToBits(as.raw(x))[seq(n)]) == "01"]
      if (length(mystrings) > 1) paste(mystrings, collapse = ", ") else NULL
    }))
  }
  dat %>% 
    group_by(Customer) %>% 
    arrange(Product) %>%
    summarize(Product_group = all_multiples(Product)) %>%
    group_by(Product_group) %>%
    count(Product_group)
}

f_jay = function(dat){
  a <- split(dat$Product, dat$Customer)  ## thx to @Henrik
  r <- range(lengths(a))
  pr <- unlist(lapply(r[1]:r[2], function(x) 
    combn(unique(dat$Product), x, list)), recursive=F)
  or <- rowSums(outer(pr, a, Vectorize(function(x, y) all(x %in% y))))
  res <- data.frame(p.group=sapply(pr, toString), number=or)
  res[res$number > 0, ]
}


f_uwe_dt = function(dat){
  setorder(setDT(dat), Customer, Product)
  dat[, .(Product.Group = unlist(lapply(tail(seq(.N), -1L), 
                                        function(m) combn(unique(Product), m, toString, FALSE)))), 
      by = Customer][
        , .N, by = Product.Group]
}

f_uwe_dplyr = function(dat){
  dat %>% 
    arrange(Customer, Product) %>% 
    group_by(Customer) %>% 
    summarise(Product.Group = n() %>% 
                seq() %>% 
                tail(-1L) %>% 
                lapply(function(m) combn(unique(Product), m, toString, FALSE)) %>% 
                unlist()) %>%
    ungroup() %>% 
    count(Product.Group)
}
Galvin answered 10/8, 2020 at 22:35 Comment(5)
Wow, that’s very thorough - thank you. And despite using R for 15 years (you wouldn’t know given my stupid question!) I had never heard of arules. Just had a quick look, seems fantastic, lots of features for this sort of thing, and more. It’s funny how when there’s something that’s used commonly outside your field, and you don’t know the terminology of it, it’s so easy to be completely unaware of it even after an internet search.Welton
@Welton Thank you for your feedback. It was certainly not a stupid question (otherwise I wouldn't have spend my time answering it ;) Indeed, the apriori function is spot on for problems like yours - no need to reinvent the wheel. And the speed gain is quite considerable. If you haven't found it already, you may have a look at the nice vignetteGalvin
Ok, WOW. I just got round to implementing this (I'm actually on holiday but had a little time to try it) and excuse the blasphemy but, Jesus that's fast. Everything else ran for ages then crashed my laptop whereas your solution finished in like a second. I am so impressed and thankful again. I'm going to be mean and switch the answer to this one (sorry Allan - I do like the tidy syntax though).Welton
I'm delighted to hear that I (or rather apriori) managed to trigger some blasphemy.Galvin
No problem @Welton - you should pick the best answer for your problem, and I think Henrik has demonstrated that his answer is the most effective here. Folks often say they're looking for a tidyverse solution because they like the syntax, but it's rarely the fastest solution to a problem if speed is important.Divider
D
3

If you define a little helper function that gets all multiple groupings:

all_multiples <- function(strings)
{
  n <- length(strings)
  do.call("c", sapply(1:2^n, function(x) {
    mystrings <- strings[as.character(rawToBits(as.raw(x))[seq(n)]) == "01"]
    if (length(mystrings) > 1) paste(mystrings, collapse = ", ") else NULL
    }))
}

then you can do this nicely in a tidyverse pipe:

dat %>% 
  group_by(Customer) %>% 
  arrange(Product) %>%
  summarize(Product_group = all_multiples(Product)) %>%
  group_by(Product_group) %>%
  count(Product_group)
#> # A tibble: 11 x 2
#> # Groups:   Product_group [11]
#>    Product_group     n
#>    <chr>         <int>
#>  1 A, B              3
#>  2 A, B, C           1
#>  3 A, B, D           1
#>  4 A, C              1
#>  5 A, D              1
#>  6 B, C              1
#>  7 B, D              1
#>  8 D, E              1
#>  9 D, E, F           1
#> 10 D, F              1
#> 11 E, F              1
Divider answered 9/8, 2020 at 9:38 Comment(2)
I love you! Thanks so much, I still haven’t quite got my head around using functions within summarise as well as I should have - so this answers my question and helps more generally with that. Thanks again.Welton
Hi again @AllanCameron. I was checking my code again against the answers so far. This time I got a warning on your code (Warning message: In rawToBits(as.raw(x)) : out-of-range values treated as 0 in coercion to raw). The output then no longer agrees. I tried to reduce the data which generates the warning and ended up with this: set.seed(3); n_cust = 5; n_product = sample(2:8, n_cust, replace = TRUE); dat = data.frame(Customer = rep(1:n_cust, n_product), Product = unlist(lapply(n_product, function(n) sample(letters[1:8], n)))). Sorry for bothering you ;)Galvin
W
2

You could split the data along customers, then get all combinations of product-pairs and triples using combn. Then find matches using %in% with outer, create data frame by collapsing products using toString and finally discard elements which are zero.

# a <- aggregate(Product ~ Customer, dat, I)$Product  ## old solution
# if (is.matrix(a)) a <- as.data.frame(t(a))  ## old solution
a <- split(dat$Product, dat$Customer)  ## thx to @Henrik
r <- range(lengths(a))
pr <- unlist(lapply(r[1]:r[2], function(x) 
  combn(unique(dat$Product), x, list)), recursive=F)
or <- rowSums(outer(pr, a, Vectorize(function(x, y) all(x %in% y))))
res <- data.frame(p.group=sapply(pr, toString), number=or)
res[res$number > 0, ]
#    p.group number
# 1     A, B      3
# 2     A, C      1
# 3     A, D      1
# 6     B, C      1
# 7     B, D      1
# 13    D, E      1
# 14    D, F      1
# 15    E, F      1
# 16 A, B, C      1
# 17 A, B, D      1
# 35 D, E, F      1

Data

dat <- read.table(header=TRUE, text="Customer    Product
1           A
1           B
1           C
2           D
2           E
2           F
3           A
3           B
3           D
4           A
4           B")
Wacker answered 9/8, 2020 at 8:11 Comment(0)
F
2

For the sake of completeness, here is a solution in data.table syntax which can be translated to dplyr syntax as well.

For both implementations, the core idea is the same:

  1. sort by Product (which is an important step which has been neglected by the other answers posted so far)
  2. For each Customer, create the product groups by using combn() with varying lengths m. Product.Group is a kind of natural key created by concatenating the included products using the toString() function.
    Here, we can see why sorting Product is important : products B, A as well as A, B should appear in the same product group A, B.
  3. Finally, count the number of occurrences by Product.Group

data.table version

library(data.table)
setorder(setDT(df), Customer, Product)
df[, .(Product.Group = unlist(lapply(tail(seq(.N), -1L), 
                              function(m) combn(unique(Product), m, toString, FALSE)))), 
   by = Customer][
     , .N, by = Product.Group]
    Product.Group N
 1:          A, B 3
 2:          A, C 1
 3:          B, C 1
 4:       A, B, C 1
 5:          D, E 1
 6:          D, F 1
 7:          E, F 1
 8:       D, E, F 1
 9:          A, D 1
10:          B, D 1
11:       A, B, D 1

dplyr version

library(dplyr)
df %>% 
  arrange(Customer, Product) %>% 
  group_by(Customer) %>% 
  summarise(Product.Group = n() %>% 
              seq() %>% 
              tail(-1L) %>% 
              lapply(function(m) combn(unique(Product), m, toString, FALSE)) %>% 
              unlist()) %>%
  ungroup() %>% 
  count(Product.Group)
   Product.Group     n
   <chr>         <int>
 1 A, B              3
 2 A, B, C           1
 3 A, B, D           1
 4 A, C              1
 5 A, D              1
 6 B, C              1
 7 B, D              1
 8 D, E              1
 9 D, E, F           1
10 D, F              1
11 E, F              1

Data

library(data.table)
df <- fread("
      Customer    Product
   1           A
   1           B
   1           C
   2           D
   2           E
   2           F
   3           A
   3           B
   3           D
   4           A
   4           B")
Fiveandten answered 10/8, 2020 at 6:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.