Why are my dplyr group_by & summarize not working properly? (name-collision with plyr)
Asked Answered
P

5

80

I have a data frame that looks like this:

#df
ID  DRUG FED  AUC0t  Tmax   Cmax
1    1     0   100     5      20
2    1     1   200     6      25
3    0     1   NA      2      30 
4    0     0   150     6      65

Ans so on. I want to summarize some statistics on AUC, Tmax and Cmax by drug DRUG and FED STATUS FED. I use dplyr. For example: for the AUC:

CI90lo <- function(x) quantile(x, probs=0.05, na.rm=TRUE)
CI90hi <- function(x) quantile(x, probs=0.95, na.rm=TRUE)  

summary <- df %>%
             group_by(DRUG,FED) %>%
             summarize(mean=mean(AUC0t, na.rm=TRUE), 
                                 low = CI90lo(AUC0t), 
                                 high= CI90hi(AUC0t),
                                 min=min(AUC0t, na.rm=TRUE),
                                 max=max(AUC0t,na.rm=TRUE), 
                                 sd= sd(AUC0t, na.rm=TRUE))

However, the output is not grouped by DRUG and FED. It gives only one line containing the statistics of all by not faceted on DRUG and FED.

Any idea why? and how can I make it do the right thing?

Petula answered 14/11, 2014 at 6:0 Comment(5)
Please check this link https://mcmap.net/q/260674/-dplyr-issues-when-using-group_by-multiple-variablesQuixote
@Quixote Thanks a lot. I was actually happy by the dplyr package but it looks it is not reliable !Petula
BTW, should you not label your functions as CI95hi and CI95lo i.e. using 95 rather than 90?Axilla
@Axilla I am using the 90% confidence interval.Petula
This is actually a known issue with plyr + dplyr + occasionally other libraries (ggplot2 + xts). Also bit me and also took ages to debug.Amadeo
S
202

I believe you've loaded plyr after dplyr, which is why you are getting an overall summary instead of a grouped summary.

This is what happens with plyr loaded last.

library(dplyr)
library(plyr)
df %>%
      group_by(DRUG,FED) %>%
      summarize(mean=mean(AUC0t, na.rm=TRUE), 
                low = CI90lo(AUC0t), 
                 high= CI90hi(AUC0t),
                 min=min(AUC0t, na.rm=TRUE),
                 max=max(AUC0t,na.rm=TRUE), 
                 sd= sd(AUC0t, na.rm=TRUE))

  mean low high min max sd
1  150 105  195 100 200 50

Now remove plyr and try again and you get the grouped summary.

detach(package:plyr)
df %>%
      group_by(DRUG,FED) %>%
      summarize(mean=mean(AUC0t, na.rm=TRUE), 
                low = CI90lo(AUC0t), 
                 high= CI90hi(AUC0t),
                 min=min(AUC0t, na.rm=TRUE),
                 max=max(AUC0t,na.rm=TRUE), 
                 sd= sd(AUC0t, na.rm=TRUE))

Source: local data frame [4 x 8]
Groups: DRUG

  DRUG FED mean low high min max  sd
1    0   0  150 150  150 150 150 NaN
2    0   1  NaN  NA   NA  NA  NA NaN
3    1   0  100 100  100 100 100 NaN
4    1   1  200 200  200 200 200 NaN
Superfamily answered 14/11, 2014 at 15:15 Comment(1)
Worth mentioning that ggplot2 can have this effect too - presumably plyr is a dependency.Baa
J
38

A variant of aosmith's answer that might help some folks out. Direct R to call dplyr's functions directly. Good trick when one package interferes with another.

df %>%
      dplyr::group_by(DRUG,FED) %>%
      dplyr::summarize(mean=mean(AUC0t, na.rm=TRUE), 
                low = CI90lo(AUC0t), 
                 high= CI90hi(AUC0t),
                 min=min(AUC0t, na.rm=TRUE),
                 max=max(AUC0t,na.rm=TRUE), 
                 sd= sd(AUC0t, na.rm=TRUE))
Java answered 2/2, 2018 at 18:35 Comment(1)
Disturbing that namespacing is seen as a trick in R XDProrogue
S
6

In addition to dplyr, users often use ggplot and with it ggpubr functions. It is in fact, another common used package that has a few incompatibilities with dplyr. In the same way, as shown above you can use dplyr::package, but if it keeps not working, as it happened to me, just detaching the library it will be enough,

detach("package:ggpubr", unload = TRUE)

df %>%
  dplyr::group_by(DRUG,FED) %>%
  dplyr::summarize(mean=mean(AUC0t, na.rm=TRUE), 
            low = CI90lo(AUC0t), 
             high= CI90hi(AUC0t),
             min=min(AUC0t, na.rm=TRUE),
             max=max(AUC0t,na.rm=TRUE), 
             sd= sd(AUC0t, na.rm=TRUE))
Seawards answered 20/4, 2021 at 14:11 Comment(0)
S
3

Or you could consider using data.table

library(data.table)
setDT(df)  # set the data frame as data table
df[, list(mean = mean(AUC0t, na.rm=TRUE),
          low = CI90lo(AUC0t), 
          high = CI90hi(AUC0t), 
          min = as.double(min(AUC0t, na.rm=TRUE)),
          max = as.double(max(AUC0t, na.rm=TRUE)), 
          sd = sd(AUC0t, na.rm=TRUE)),
   by=list(DRUG, FED)]

#    DRUG FED mean low high min  max sd
# 1:    1   0  100 100  100 100  100 NA
# 2:    1   1  200 200  200 200  200 NA
# 3:    0   1  NaN  NA   NA Inf -Inf NA
# 4:    0   0  150 150  150 150  150 NA
# Warning messages:
#   1: In min(AUC0t, na.rm = TRUE) :
#   no non-missing arguments to min; returning Inf
# 2: In max(AUC0t, na.rm = TRUE) :
#   no non-missing arguments to max; returning -Inf
Synergistic answered 14/11, 2014 at 6:49 Comment(1)
thanks a lot. That would work too, however, I used the ddply instead. ddply looks to be more reliable than the dplyr.Petula
E
0

Try sqldf is best way and easy to learn for grouping the data. Below is example to your need.all kinds of data sample grouping sqldf library is very helpful.

install.packages("sqldf")
library(sqldf)
dat1 <- sqldf("select x,y,
            y/sum(y) as Z
            from dat
            group by x")
Emia answered 21/8, 2019 at 7:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.