Venn diagram from list of clusters and co-occurring factors

set.seed(1) x = paste("cluster-",sample(c(1:100),500,replace=TRUE),sep="") y = c( paste("factor-",sample(c(letters[1:3]),300, replace=TRUE),sep=""), paste("factor-",sample(c(letters[1]),100, replace=TRUE),sep=""), paste("factor-",sample(c(letters[2]),50, replace=TRUE),sep=""), paste("factor-",sample(c(letters[3]),50, replace=TRUE),sep="") ) data = data.frame(cluster=x,factor=y)

I've provided two solutions, using two different packages with Venn diagram capabilities. As you expected, both involve an initial step using the aggregate() function.

I tend to prefer the results from the venneuler package. It's default label positions aren't ideal, but you could adjust them by having a look at the associated plot method (possibly using locator() to select the coordinates).

Solution the 1st:

One possibility is to use venneuler() in the venneuler package to draw your Venn diagram.

library(venneuler)

## Modify the "factor" column, by renaming it and converting
## it to a character vector.
levels(data$factor) <- c("a", "b", "c")
data$factor <- as.character(data$factor)

## FUN is an anonymous function that determines which letters are present
## 2 or more times in the cluster and then pastes them together into 
## strings of a form that venneuler() expects.
##
inter <- aggregate(factor ~ cluster, data=data,
                   FUN = function(X) {
                       tab <- table(X)
                       names <- names(tab[tab>=2])
                       paste(sort(names), collapse="&")
                   })            
## Count how many clusters contain each combination of letters
counts <- table(inter$factor)
counts <- counts[names(counts)!=""]  # To remove groups with <2 of any letter
#  a   a&b a&b&c   a&c     b   b&c     c 
# 19    13    12    14    13     9    12 

## Convert to proportions for venneuler()
ps <- counts/sum(counts)

## Calculate the Venn diagram
vd <- venneuler(c(a=ps[["a"]], b = ps[["b"]], c = ps[["c"]],
                  "a&b" = ps[["a&b"]],
                  "a&c" = ps[["a&c"]],
                  "b&c" = ps[["b&c"]],
                  "a&b&c" = ps[["a&b&c"]]))
## Plot it!
plot(vd)

A few notes about choices I made in writing this code:

I've changed the names of factors from "factor-a" to "a". You can obviously change that back.
I've only required each factor to be present >=2 times (instead of >10) to be counted within each cluster. (That was to demonstrate the code with this small subset of your data.)
If you take a look at the intermediate object counts, you'll see that it contains an initial unnamed element. That element is the number of clusters that contain fewer than 2 of any letter. You can decide better than I whether or not you want to include those in the calculation of the subsequent ps ('proportions') object.

enter image description here

Solution the 2nd:

Another possibility is to employ vennCounts() and vennDiagram() in the Bioconductor package limma. To download the package, follow the instructions here. Unlike the venneuler solution above, the overlap in the resultant diagram is not proportional to the actual degree of intersection. Instead, it annotates the diagram with the actual frequencies. (Note that this solution does not involve any edits to the data$factor column.)

library(limma)

out <- aggregate(factor ~ cluster, data=data, FUN=table)
out <- cbind(out[1], data.frame(out[2][[1]]))

counts <- vennCounts(out[, -1] >= 2)
vennDiagram(counts, names = c("Factor A", "Factor B", "Factor C"),
            cex = 1, counts.col = "red")

enter image description here

Recommended topics

Hot tags