Association analysis with duplicate transactions using arules package in R
Asked Answered
C

3

8

I want to create a transaction object in basket format which I can call anytime for my analyses. The data contains comma separated items with 1001 transactions. The first 10 transactions look like this:

hering,corned_b,olives,ham,turkey,bourbon,ice_crea
baguette,soda,hering,cracker,heineken,olives,corned_b
avocado,cracker,artichok,heineken,ham,turkey,sardines
olives,bourbon,coke,turkey,ice_crea,ham,peppers
hering,corned_b,apples,olives,steak,avocado,turkey
sardines,heineken,chicken,coke,ice_crea,peppers,ham
olives,bourbon,coke,turkey,ice_crea,heineken,apples
corned_b,peppers,bourbon,cracker,chicken,ice_crea,baguette
soda,olives,bourbon,cracker,heineken,peppers,baguette
corned_b,peppers,bourbon,cracker,chicken,bordeaux,hering
...

I observed that there are duplicated transactions in the data and removed them but each time I tried to read the transactions, I get:

Error in asMethod(object) : can not coerce list with transactions with duplicated items

Here is my code:

data <- read.csv("AssociationsItemList.txt",header=F)
data <-  data[!duplicated(data),]
pop <- NULL
for(i in 1:length(data)){
pop <- paste(pop, data[i],sep="\n")
}
write(pop, file = "Trans", sep = ",")
transdata <- read.transactions("Trans", format = "basket", sep=",")

I'm sure there's something little yet important I've missed. Kindly offer your assistance.

Cityscape answered 17/6, 2013 at 14:9 Comment(2)
Sorry, you're writing as a csv it looks like (or something close) have you tried a read.csv or read.table at the end?Picky
how the above transaction file is created without header columnsBelvia
F
18

The problem is not with duplicated transactions (the same row appearing twice) but duplicated items (the same item appearing twice, in the same transaction -- e.g., "olives" on line 4).

read.transactions has an rm.duplicates argument to remove those duplicates.

read.transactions("Trans", format = "basket", sep=",", rm.duplicates=TRUE)
Fitzsimmons answered 17/6, 2013 at 15:21 Comment(2)
Would you mind explaining why duplicate items in the same transaction are not allowed? What if for example you wanted to show that you bought double the normal amount of olives? In my transaction data for example there's a quantity column and I wasn't sure how to account for thatSommers
I think this kind of information is not what APRIORI handles.Blockish
B
2

Vincent Zoonekynd is right, the problem is caused by duplicated items in a transaction. Here I can explain why arules require transactions without duplicated items.

  • The data of transactions is store internally as a ngCMatrix Object. Relevant source code:

    setClass("itemMatrix",
      representation(
        data        = "ngCMatrix", 
    ...
    setClass("transactions",
      contains = "itemMatrix",
    ...
    
  • ngCMatrix is an sparse matrix defined at Matrix package. It's description from official document:

    The nsparseMatrix class is a virtual class of sparse “pattern” matrices, i.e., binary matrices conceptually with TRUE/FALSE entries. Only the positions of the elements that are TRUE are stored

It seems ngCMatirx stored status of an element by an binary indicator. Which means the transactions object in arules can only store exist/not exist for a transaction object and can not record quantity. So...

Botha answered 3/2, 2017 at 13:8 Comment(0)
H
0

I just used the 'unique' function to remove duplicates. My data was a little different since I had a dataframe (data was too large for a CSV) and I had 2 columns: product_id and transaction_id. I know it's not your specific question, but I had to do this to create the transaction dataset and apply association rules.

data # > 1 Million Transactions 
data <- unique(data[ , 1:2 ] )
trans <- as(split(data[,"product_id"], data[,"trans_id"]),"transactions")
rules <- apriori(trans, parameter = list(supp = 0.001, conf = 0.2))
Hypnogenesis answered 7/7, 2017 at 17:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.