Algorithm for automating pairwise significance grouping labels in R
Asked Answered
P

3

7

After struggling with this problem for a while, I am hoping to get some advice here. I am wondering if anyone is aware of an automated method for determining pairwise grouping labels based on significance. The question is independent of the significance test (e.g. Tukey for parametric or Mann-Whitney for non-parametric) - given these pairwise comparisons, some boxplot-type figures often represent these groupings with a sub-script:

enter image description here

I have done this example by hand, which can be quite tedious. I think that the sequence of labeling in the algorithm should be based on the number of levels in each group - e.g. those groups containing single levels that are significantly different from all other levels should be named first, then groups containing 2 levels, then 3, etc., all the while checking that new groupings add a new needed grouping and do not violate and differences.

In the example below, the tricky part is getting the algorithm to recognize that level 1 should be grouped with 3 and 5, but 3 and 5 should not be grouped (i.e. share a label).

Example code:

set.seed(1)
n <- 7
n2 <- 100
mu <- cumsum(runif(n, min=-3, max=3))
sigma <- runif(n, min=1, max=3)

dat <- vector(mode="list", n)
for(i in seq(dat)){
    dat[[i]] <- rnorm(n2, mean=mu[i], sd=sigma[i])
}

df <- data.frame(group=as.factor(rep(seq(n), each=n2)), y=unlist(dat))

bp <- boxplot(y ~ group, df, notch=TRUE)
kr <- kruskal.test(y ~ group, df)
kr
mw <- pairwise.wilcox.test(df$y, df$g)
mw
mw$p.value > 0.05 # TRUE means that the levels are not significantly different at the p=0.05 level

#      1     2     3     4     5     6
#2 FALSE    NA    NA    NA    NA    NA
#3  TRUE FALSE    NA    NA    NA    NA
#4 FALSE FALSE FALSE    NA    NA    NA
#5  TRUE FALSE FALSE FALSE    NA    NA
#6 FALSE FALSE FALSE  TRUE FALSE    NA
#7 FALSE FALSE FALSE FALSE FALSE FALSE

text(x=1:n, y=bp$stats[4,], labels=c("AB", "C", "A", "D", "B", "D", "E"), col=1, cex=1.5, pos=3, font=2)
Phonologist answered 15/5, 2014 at 14:48 Comment(10)
If I understand correctly, and you want the minimum number of labels, then this is the clique cover problem.Muddy
Yes, correct - the least number of labels possible. Do you happen to know of an implementation of this in R?Phonologist
How many populations can there be?Muddy
I'm not at all an R user.Muddy
@DavidEisenstat - I'm trying to find some general solution that I can use for various situations - not just the example here containing 7 levels.Phonologist
I understand. I'm trying to determine an appropriate algorithm to recommend, since there's an implementation complexity/time complexity tradeoff.Muddy
@DavidEisenstat - Lets say there is a maximum of 20Phonologist
Does your significance test yield a strict partial order? In other words, instead of just TRUE and FALSE, can you return labels {LESS, UNKNOWN, GREATER}, such that, if (a, b) is LESS, then (b, a) is GREATER, and if (a, b) is LESS, and (b, c) is LESS, then (a, c) is LESS also?Muddy
No - there is no need to have a particular order - just whether or not they belong together at a given cutoffPhonologist
FYI, the multcompLetters function in the multcompView package also does this.Malevolent
M
4

First let me restate the problem in the language of graph theory. Define a graph as follows. Each sample gives rise to a vertex that represents it. Between two vertices, there is an edge if and only if some test indicates that the samples represented by those vertices could not be distinguished statistically. In graph theory, a clique is a set of vertices such that, between every two vertices in the set, there is an edge. We're looking for a collection of cliques such that every edge in the graph belongs to (at least? exactly?) one of the cliques. We'd like to use as few cliques as possible. (This problem is called clique edge cover, not clique cover.) We then assign each clique its own letter and label its members with that letter. Each sample distinguishable from all others gets its own letter as well.

For example, the graph corresponding to your sample input could be drawn like this.

3---1---5       4--6

My proposed algorithm is the following. Construct the graph and use the Bron--Kerbosch algorithm to find all maximal cliques. For the graph above, these are {1, 3}, {1, 5}, and {4, 6}. The set {1}, for example, is a clique, but it is not maximal because it is a subset of the clique {1, 3}. The set {1, 3, 5} is not a clique because there is no edge between 3 and 5. In the graph

  1
 / \
3---5       4--6,

the maximal cliques would be {1, 3, 5} and {4, 6}.

Now search recursively for a small clique edge cover. The input to our recursive function is a set of edges remaining to be covered and the list of maximal cliques. Find the least edge in the remaining set, where, e.g., edge (1,2) < (1,5) < (2,3) < (2,5) < (3,4). For each maximal clique that contains this edge, construct a candidate solution comprised of that clique and the output of a recursive call where the clique edges are removed from set of edges remaining. Output the best candidate.

Unless there are very few edges, this may be too slow. The first performance improvement is memoize: maintain a map from inputs to outputs of the recursive function so that we can avoid doing the work twice. If that doesn't work, then R should have an interface to an integer program solver, and we can use integer programming to determine the best collection of cliques. (I'll explain this more if the other approach is insufficient.)

Muddy answered 15/5, 2014 at 16:21 Comment(1)
Thank you so much for this. You got me much of the way, and the other half came from this other question.Phonologist
P
2

I thought I would post the solution that I was able to derive with additional help from the following question:

set.seed(1)
n <- 7
n2 <- 100
mu <- cumsum(runif(n, min=-3, max=3))
sigma <- runif(n, min=1, max=3)

dat <- vector(mode="list", n)
for(i in seq(dat)){
    dat[[i]] <- rnorm(n2, mean=mu[i], sd=sigma[i])
}
df <- data.frame(group=as.factor(rep(seq(n), each=n2)), y=unlist(dat))
bp <- boxplot(y ~ group, df, notch=TRUE)


#significance test
kr <- kruskal.test(y ~ group, df)
mw <- pairwise.wilcox.test(df$y, df$g)

#matrix showing connections between levels
g <- as.matrix(mw$p.value > 0.05)
g <- cbind(rbind(NA, g), NA)
g <- replace(g, is.na(g), FALSE)
g <- g + t(g)
diag(g) <- 1
rownames(g) <- 1:n
colnames(g) <- 1:n
g

#install.packages("igraph")
library(igraph)

# Load data
same <- which(g==1)
topology <- data.frame(N1=((same-1) %% n) + 1, N2=((same-1) %/% n) + 1)
topology <- topology[order(topology[[1]]),] # Get rid of loops and ensure right naming of vertices
g3 <- simplify(graph.data.frame(topology,directed = FALSE))
get.data.frame(g3)

# Plot graph
plot(g3)

# Calcuate the maximal cliques
res <- maximal.cliques(g3)

# Reorder given the smallest level
res <- sapply(res, sort)
res <- res[order(sapply(res,function(x)paste0(sort(x),collapse=".")))]

ml<-max(sapply(res, length))
reord<-do.call(order, data.frame(
    do.call(rbind, 
        lapply(res, function(x) c(sort(x), rep.int(0, ml-length(x))))
    )
))
res <- res[reord]

lab.txt <- vector(mode="list", n)
lab <- letters[seq(res)]
for(i in seq(res)){
    for(j in res[[i]]){
        lab.txt[[j]] <- paste0(lab.txt[[j]], lab[i])
    }
}

bp <- boxplot(y ~ group, df, notch=TRUE, outline=FALSE, ylim=range(df$y)+c(0,1))
text(x=1:n, y=bp$stats[5,], labels=lab.txt, col=1, cex=1, pos=3, font=2)

enter image description here

Phonologist answered 19/5, 2014 at 4:29 Comment(3)
Thanks so much for this! I'm using R 3.0.1, and I had to remove the line "res <- sapply(res, sort)" to get it to work, but after that it worked great.Malevolent
with the current igraph version reord <- ... seems not to work: Error in parse_op_args(..., what = "a vertex", is_fun = is_igraph_vs, : Not a vertex sequenceTurgite
Looks like you can update it to the new version by changing the part inside the second do.call to lapply(res, function(x) c(sort(as.vector(x)), rep.int(0, ml-length(x)))). In the 2015 update, igraph took out the ability to automatically treat vertices as vectors (see: igraph.org/2015/06/24/igraph-1.0.0-r.html). That said, I'm not 100% sure it's not redundant with the previous reordering step.Malevolent
U
0

Cool code.

I think you need to quote the function order() when calling do.call:

reord<-do.call("order", data.frame(
do.call(rbind, 
    lapply(res, function(x) c(sort(x), rep.int(0, ml-length(x))))
)
))
Upolu answered 24/10, 2014 at 14:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.