dplyr issues when using group_by(multiple variables)
Asked Answered
G

5

58

I want to start using dplyr in place of ddply but I can't get a handle on how it works (I've read the documentation).

For example, why when I try to mutate() something does the "group_by" function not work as it's supposed to?

Looking at mtcars:

library(car)

Say I make a data.frame which is a summary of mtcars, grouped by "cyl" and "gear":

df1 <- mtcars %.%
            group_by(cyl, gear) %.%
            summarise(
                newvar = sum(wt)
            )

Then say I want to further summarise this dataframe. With ddply, it'd be straightforward, but when I try to do with with dplyr, it's not actually "grouping by":

df2 <- df1 %.%
            group_by(cyl) %.%
            mutate(
                newvar2 = newvar + 5
            )

Still yields an ungrouped output:

  cyl gear newvar newvar2
1   6    3  6.675  11.675
2   4    4 19.025  24.025
3   6    4 12.375  17.375
4   6    5  2.770   7.770
5   4    3  2.465   7.465
6   8    3 49.249  54.249
7   4    5  3.653   8.653
8   8    5  6.740  11.740

Am I doing something wrong with the syntax?


Edit:

If I were to do this with plyr and ddply:

df1 <- ddply(mtcars, .(cyl, gear), summarise, newvar = sum(wt))

and then to get the second df:

df2 <- ddply(df1, .(cyl), summarise, newvar2 = sum(newvar) + 5)

But that same approach, with sum(newvar) + 5 in the summarise() function doesn't work with dplyr...

Grindstone answered 8/2, 2014 at 23:50 Comment(5)
Can you give us the equivalent plyr code with ddply please ?Southerner
what do you mean by "ungrouped"? where you expecting one row per group? or where you expecting that all rows from a same group be below each other?Tying
I'd expect just three rows for the second df (one for each cyl), as it looks with the ddply arguments that I just added in the edits... I assume this is just a matter of adding one argument somewhere that I'm missing?Grindstone
Then I think you are confusing mutate and summarise.Tying
Ah, so I am. Will summarise be as efficient as mutate if I want to summarise a dataframe while also adding new variables?Grindstone
S
45

Taking Dickoa's answer one step further -- as Hadley says "summarise peels off a single layer of grouping". It peels off grouping from the reverse order in which you applied it so you can just use

mtcars %>%
 group_by(cyl, gear) %>%
 summarise(newvar = sum(wt)) %>%
 summarise(newvar2 = sum(newvar) + 5)

Note that this will give a different answer if you use group_by(gear, cyl) in the second line.

And to get your first attempt working:

df1 <- mtcars %>%
 group_by(cyl, gear) %>%
 summarise(newvar = sum(wt))

df2 <- df1 %>%
 group_by(cyl) %>%
 summarise(newvar2 = sum(newvar)+5)
Scheel answered 9/2, 2014 at 7:1 Comment(2)
I'd still like to get better information on Hadley's "peels off" metaphor. Does anyone have some references or other posted answers regarding it?Seacock
cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html, see section containing the phrase: "each summary peels off one level of the grouping"Ciaphus
D
82

I had a similar problem. I found that simply detaching plyr solved it:

detach(package:plyr)    
library(dplyr)
Dorton answered 14/8, 2014 at 16:45 Comment(3)
Been sitting here pulling my hair out for the last hour and a half trying to understand why dplyr was simply ignoring my groupings. Glad to know I'm not just crazy.Nutrition
I couldn't figure out why code ran fine once using summarize but not upon visiting it later. Indeed, I'd added plyr after loading dplyr. This is why. Not sure if it's a recent addition, but I caught this recently when loading the two: You have loaded plyr after dplyr - this is likely to cause problems. If you need functions from both plyr and dplyr, please load plyr first, then dplyr: library(plyr); library(dplyr).Devland
This happens often with dplyr methods being overloaded. A general solution is to explicitly reference the dplyr's version of the method using dplyr::summerise(...).Flattie
S
45

Taking Dickoa's answer one step further -- as Hadley says "summarise peels off a single layer of grouping". It peels off grouping from the reverse order in which you applied it so you can just use

mtcars %>%
 group_by(cyl, gear) %>%
 summarise(newvar = sum(wt)) %>%
 summarise(newvar2 = sum(newvar) + 5)

Note that this will give a different answer if you use group_by(gear, cyl) in the second line.

And to get your first attempt working:

df1 <- mtcars %>%
 group_by(cyl, gear) %>%
 summarise(newvar = sum(wt))

df2 <- df1 %>%
 group_by(cyl) %>%
 summarise(newvar2 = sum(newvar)+5)
Scheel answered 9/2, 2014 at 7:1 Comment(2)
I'd still like to get better information on Hadley's "peels off" metaphor. Does anyone have some references or other posted answers regarding it?Seacock
cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html, see section containing the phrase: "each summary peels off one level of the grouping"Ciaphus
S
11

If you translate your plyr code into dplyr using summarise instead of mutate you get the same results.

library(plyr)
df1 <- ddply(mtcars, .(cyl, gear), summarise, newvar = sum(wt))
df2 <- ddply(df1, .(cyl), summarise, newvar2 = sum(newvar) + 5)
df2
##   cyl newvar2
## 1   4  30.143
## 2   6  26.820
## 3   8  60.989

detach(package:plyr)    
library(dplyr)
mtcars %.%
    group_by(cyl, gear) %.%
    summarise(newvar = sum(wt)) %.%
    group_by(cyl) %.%
    summarise(newvar2 = sum(newvar) + 5)
##   cyl newvar2
## 1   4  30.143
## 2   8  60.989
## 3   6  26.820

EDIT

Since summarise drops the last group (gear) you can skip the second group_by (see @hadley comment below)

library(dplyr)
mtcars %.%
    group_by(cyl, gear) %.%
    summarise(newvar = sum(wt)) %.%
    summarise(newvar2 = sum(newvar) + 5)
##   cyl newvar2
## 1   4  30.143
## 2   8  60.989
## 3   6  26.820
Southerner answered 9/2, 2014 at 0:28 Comment(4)
So the second "group_by()" and "summarise()" calls overwrite the first ones?Grindstone
Yes and you can use also regroup to enforce that.Southerner
You don't need the second group_by() here because summarise automatically drops the last group (the group it collapsed).Cherin
If you don't want to detach plyr for some reason, you can always just specify dplyr:: in front of the group_by and summarize functions.Pharyngology
U
6

Detaching plyr is one way to solve the problem so you can use dplyr functions as desired... but what if you need other functions from plyr to complete other tasks in your code?

(In this example, I've got both dplyr and plyr libraries loaded)

Suppose we have a simple data.frame and we want to compute the groupwise sum of the variable value, when grouped by different levels of gname

> dx<-data.frame(gname=c(1,1,1,2,2,2,3,3,3), value = c(2,2,2,4,4,4,5,6,7))
> dx
  gname value
1     1     2
2     1     2
3     1     2
4     2     4
5     2     4
6     2     4
7     3     5
8     3     6
9     3     7

But when we try to use what we believe will produce a dplyr grouped sum, here's what happens:

dx %>% group_by(gname) %>% mutate(mysum=sum(value))
Source: local data frame [9 x 3]
Groups: gname

  gname value mysum
1     1     2    36
2     1     2    36
3     1     2    36
4     2     4    36
5     2     4    36
6     2     4    36
7     3     5    36
8     3     6    36
9     3     7    36

It doesn't give us the desired answer. Probably because of some interaction or overloading of the group_by and or mutate functions between dplyr and plyr. We could detach plyr, but another way is to give a unique call to the dplyr versions of group_by and mutate:

dx %>% dplyr::group_by(gname) %>% dplyr::mutate(mysum=sum(value))
Source: local data frame [9 x 3]
Groups: gname

  gname value mysum
1     1     2     6
2     1     2     6
3     1     2     6
4     2     4    12
5     2     4    12
6     2     4    12
7     3     5    18
8     3     6    18
9     3     7    18

now we see that this works as expected.

Uird answered 27/2, 2015 at 2:14 Comment(0)
P
5

dplyr is working as you should expect in your example. Mutate, as you specified it, will just add 5 to each value of newvar as it creates newvar2. This would look the same if you group or not. If, however, you specify something that differs by group you will get something different. For example:

df1 %.%
            group_by(cyl) %.%
            mutate(
                newvar2 = newvar + mean(cyl)
            )
Prefect answered 9/2, 2014 at 0:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.