I've got an input file with a list of ~50000 clusters and presence of a number of factors in each of them (~10 million entries in total), see a smaller example below:
set.seed(1)
x = paste("cluster-",sample(c(1:100),500,replace=TRUE),sep="")
y = c(
paste("factor-",sample(c(letters[1:3]),300, replace=TRUE),sep=""),
paste("factor-",sample(c(letters[1]),100, replace=TRUE),sep=""),
paste("factor-",sample(c(letters[2]),50, replace=TRUE),sep=""),
paste("factor-",sample(c(letters[3]),50, replace=TRUE),sep="")
)
data = data.frame(cluster=x,factor=y)
With a bit of help from another question, I got it to produce a piechart for co-occurrence of factors like this:
counts = with(data, table(tapply(factor, cluster, function(x) paste(as.character(sort(unique(x))), collapse='+'))))
pie(counts[counts>1])
But now I would like to have a venn diagram for the co-occurrence of factors. Ideally, also in a way that can take a threshold for the minimum count for each factor. For example, a venn diagram for the different factors so that each one of them has to be present n>10 in each cluster to be taken into account.
I've tried to find a way to produce the table counts with aggregate, but couldn't make it work.
venneuler
library, or this brief article in the Journal of Stat Software using thevenn
library (Murdoch, 2004). If this is purely about R programming it should be migrated to SO. – Insalivatey
are independent of the seed, so is it extraneous to usesample()
to produce them? You could userep
instead, or were these supposed to random? – Fungous