Parallel wilcox.test using group_by and summarise
Asked Answered
P

2

9

There must be an R-ly way to call wilcox.test over multiple observations in parallel using group_by. I've spent a good deal of time reading up on this but still can't figure out a call to wilcox.test that does the job. Example data and code below, using magrittr pipes and summarize().

library(dplyr)
library(magrittr)

# create a data frame where x is the dependent variable, id1 is a category variable (here with five levels), and id2 is a binary category variable used for the two-sample wilcoxon test
df <- data.frame(x=abs(rnorm(50)),id1=rep(1:5,10), id2=rep(1:2,25))

# make sure piping and grouping are called correctly, with "sum" function as a well-behaving example function 
df %>% group_by(id1) %>% summarise(s=sum(x))
df %>% group_by(id1,id2) %>% summarise(s=sum(x))

# make sure wilcox.test is called correctly 
wilcox.test(x~id2, data=df, paired=FALSE)$p.value

# yet, cannot call wilcox.test within pipe with summarise (regardless of group_by). Expected output is five p-values (one for each level of id1)
df %>% group_by(id1) %>% summarise(w=wilcox.test(x~id2, data=., paired=FALSE)$p.value) 
df %>% summarise(wilcox.test(x~id2, data=., paired=FALSE))

# even specifying formula argument by name doesn't help
df %>% group_by(id1) %>% summarise(w=wilcox.test(formula=x~id2, data=., paired=FALSE)$p.value)

The buggy calls yield this error:

Error in wilcox.test.formula(c(1.09057358373486, 
    2.28465932554436, 0.885617572657959,  : 'formula' missing or incorrect

Thanks for your help; I hope it will be helpful to others with similar questions as well.

Perlis answered 3/1, 2016 at 20:39 Comment(2)
The other answers are more complete, but just for the sake of list all possible solutions: df %>% group_by(id1) %>% summarise(w=wilcox.test(x[id2==1], x[id2==2], paired=FALSE)$p.value)Detergent
@Detergent your solution works best for me, because in my case id1 is non numeric and your solution still works. I firs tried using do() as shown elsewhere on this page, and I got an error.Leet
B
1

You can do this with base R (although the result is a cumbersome list):

by(df, df$id1, function(x) { wilcox.test(x~id2, data=x, paired=FALSE)$p.value })

or with dplyr:

ddply(df, .(id1), function(x) { wilcox.test(x~id2, data=x, paired=FALSE)$p.value })

  id1        V1
1   1 0.3095238
2   2 1.0000000
3   3 0.8412698
4   4 0.6904762
5   5 0.3095238
Beset answered 3/1, 2016 at 20:48 Comment(0)
I
17

Your task will be easily accomplished using the do function (call ?do after loading the dplyr library). Using your data, the chain will look like this:

df <- data.frame(x=abs(rnorm(50)),id1=rep(1:5,10), id2=rep(1:2,25))
df <- tbl_df(df)
res <- df %>% group_by(id1) %>% 
       do(w = wilcox.test(x~id2, data=., paired=FALSE)) %>% 
       summarise(id1, Wilcox = w$p.value)

output

res
Source: local data frame [5 x 2]

    id1    Wilcox
  (int)     (dbl)
1     1 0.6904762
2     2 0.4206349
3     3 1.0000000
4     4 0.6904762
5     5 1.0000000

Note I added the do function between the group_by and summarize.
I hope it helps.

Interpolate answered 3/1, 2016 at 21:3 Comment(1)
Excellent answer using group_by and pipes, which were part of the original question. I selected the response from @patrickmdnet as the official answer since its elegant dplyr method worked "out of the box" for my more complex real-world data frame which threw some yet unknown wrench into the group_by/do piped method listed here.Perlis
B
1

You can do this with base R (although the result is a cumbersome list):

by(df, df$id1, function(x) { wilcox.test(x~id2, data=x, paired=FALSE)$p.value })

or with dplyr:

ddply(df, .(id1), function(x) { wilcox.test(x~id2, data=x, paired=FALSE)$p.value })

  id1        V1
1   1 0.3095238
2   2 1.0000000
3   3 0.8412698
4   4 0.6904762
5   5 0.3095238
Beset answered 3/1, 2016 at 20:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.