Summarize all group values and a conditional subset in the same call
Asked Answered
D

4

78

I'll illustrate my question with an example.

Sample data:

 df <- data.frame(ID = c(1, 1, 2, 2, 3, 5), A = c("foo", "bar", "foo", "foo", "bar", "bar"), B =     c(1, 5, 7, 23, 54, 202))

df
  ID   A   B
1  1 foo   1
2  1 bar   5
3  2 foo   7
4  2 foo  23
5  3 bar  54
6  5 bar 202

What I want to do is to summarize, by ID, the sum of B and the sum of B when A is "foo". I can do this in a couple steps like:

require(magrittr)
require(dplyr)

df1 <- df %>%
  group_by(ID) %>%
  summarize(sumB = sum(B))

df2 <- df %>%
  filter(A == "foo") %>%
  group_by(ID) %>%
  summarize(sumBfoo = sum(B))

left_join(df1, df2)

  ID sumB sumBfoo
1  1    6       1
2  2   30      30
3  3   54      NA
4  5  202      NA

However, I'm looking for a more elegant/faster way, as I'm dealing with 10gb+ of out-of-memory data in sqlite.

require(sqldf)
my_db <- src_sqlite("my_db.sqlite3", create = T)
df_sqlite <- copy_to(my_db, df)

I thought of using mutate to define a new Bfoo column:

df_sqlite %>%
  mutate(Bfoo = ifelse(A=="foo", B, 0))

Unfortunately, this doesn't work on the database end of things.

Error in sqliteExecStatement(conn, statement, ...) : 
  RS-DBI driver: (error in statement: no such function: IFELSE)
Dissolution answered 7/5, 2014 at 21:33 Comment(8)
I believe the problem is that you are trying to mix character and numeric in Bfoo which is not possibleAdriell
@beginneR it's all numeric and the code runs fine locally...Dissolution
What versions of R and dplry are you using? Your code ran fine for me locally as well (after I changed %>% to %.%).Centuple
@Centuple yeah it's supposed to run "locally" (by that I meant on a data.frame and not in the sqlite database). I'm trying to figure out how to best run it in sqlite, which doesn't recognize "ifelse"Dissolution
@pearpies I don't understand. Is this running in R? or is this running in sqllite? How is it getting to the database?Centuple
sorry my OP is a bit confusing. will editDissolution
Try using if(A=="foo") B else 0 - dplyr will try to convert that to a SQL case statement, which might work for you.Whomp
In case anyone is looking to summarize based on mathematical conditions (instead of matching strings): #59199773Mushroom
D
38

Writing up @hadley's comment as an answer

df_sqlite %>%
  group_by(ID) %>%
  mutate(Bfoo = if(A=="foo") B else 0) %>%
  summarize(sumB = sum(B),
            sumBfoo = sum(Bfoo)) %>%
  collect
Dissolution answered 7/5, 2014 at 22:56 Comment(2)
what is the function of dplyr::collect here?Laveen
collect forces the evaluation of the verbs prior to itDissolution
M
138

You can do both sums in a single dplyr statement:

df1 <- df %>%
  group_by(ID) %>%
  summarize(sumB = sum(B),
            sumBfoo = sum(B[A=="foo"]))

And here is a data.table version:

library(data.table)

dt = setDT(df) 

dt1 = dt[ , .(sumB = sum(B),
              sumBfoo = sum(B[A=="foo"])), 
          by = ID]

dt1
   ID sumB sumBfoo
1:  1    6       1
2:  2   30      30
3:  3   54       0
4:  5  202       0
Micrometeorite answered 7/5, 2014 at 22:37 Comment(6)
this would work if I could fit everything into memory. i'll definitely keep it in mind. problem is sql doesn't recognize [. sorry i wasn't clear in the question!Dissolution
The dplyr database vignette might be helpful for future reference (in case you haven't already seen it): cran.rstudio.com/web/packages/dplyr/vignettes/databases.htmlMicrometeorite
How could I miss that in the documentation?! Imho one of dplyr's most useful functionalities for dealing with data, especially when working with tidy data.Veneer
great answer! Have not come across this kind of subsetting before : df %>% summarize(function(colname[conditional_colname]))Steinmetz
Wow, never knew u could subset like that. Thank you very much!Vamp
@Micrometeorite I'm aware of how old this answer is, but do you happen to know the data.table alternative for the conditional A == "foo" that you used for sumBfoo?Tirrell
D
38

Writing up @hadley's comment as an answer

df_sqlite %>%
  group_by(ID) %>%
  mutate(Bfoo = if(A=="foo") B else 0) %>%
  summarize(sumB = sum(B),
            sumBfoo = sum(Bfoo)) %>%
  collect
Dissolution answered 7/5, 2014 at 22:56 Comment(2)
what is the function of dplyr::collect here?Laveen
collect forces the evaluation of the verbs prior to itDissolution
I
16

If you want to do counting instead of summarizing, then the answer is somewhat different. The change in code is small, especially in the conditional counting part.

df1 <- df %>%
    group_by(ID) %>%
    summarize(countB = n(),
              countBfoo = sum(A=="foo"))

df1
Source: local data frame [4 x 3]

  ID countB countBfoo
1  1      2         1
2  2      2         2
3  3      1         0
4  5      1         0
Identic answered 8/12, 2014 at 13:11 Comment(0)
I
-1

If you wanted to count the rows, instead of summing them, can you pass a variable to the function:

    df1 <- df %>%
group_by(ID) %>%
summarize(RowCountB = n(),
          RowCountBfoo = n(A=="foo"))

I get an error both with n() and nrow().

Ihab answered 21/4, 2019 at 9:36 Comment(1)
Is this an answer or another question?Breeches

© 2022 - 2024 — McMap. All rights reserved.