dplyr idiom for summarize() a filtered-group-by, and also replace any NAs due to missing rows

Asked 2/5, 2014 at 23:31 Answered 3/5, 2014 at 8:12

I am computing a dplyr::summarize across a dataframe of sales data. I do a group-by (S,D,Y), then within each group, compute medians and means for weeks 5..43, then merge those back into the parent df. Variable X is sales. X is never NA (i.e. there are no explicit NAs anywhere in df), but if there is no data (as in, no sales) for that S,D,Y and set of weeks, there will simply be no row with those values in df (take it that means zero sales for that particular set of parameters). In other words, impute X=0 in any structurally missing rows (but I hope I don't need to melt/cast the original df, to avoid bloat. Similar to cast(fill....,add.missing=T) or caret::preProcess()).

Two questions about my code idiom:

Is it better to use summarize than dplyr::filter, because filter physically drops rows so I have to assign the results to df.tmp then left-join it back to the original df (as below)? Also, big subsetting expressions repeated on every single line of summarize computations make the code harder to read. Should I worry (or not) about caching the rows or logical indices of the subsetting operation, in the general case where I might be computing say n=20 new summary variables?
Not all combinations of S,D,Y-groups and filter (for those weeks) have rows, so how to get the summarize to replace NA on any missing rows? Currently I do as below.

Sorry both the code and dataset are proprietary, but here's the code idiom, and below is code you should run first to generate sample-data:

# Compute median, mean of X across wks 5..43, for that set of S,D,Y-values
# Issue a) filter() or repeatedly use subset() within each calculation?
df.tmp <- df %.% group_by(S,D,Y) %.% filter(Week>=5 & Week<=43) %.%
  summarize(ysd_med543_X  = median(X),
            ysd_mean543_X = mean(X)
           ) %.% ungroup()

# Issue b) how to replace NAs in groups where the group_by-and-filter gave empty output?
# can you merge this code with the summarize above?
df <- left_join(df, df.tmp, copy=F)
newcols <- match(c('ysd_mean543_X','ysd_med543_X'), names(df))
df[!complete.cases(df[,newcols]), newcols] <- c(0.0,0.0)

and run this first to generate sample-data:

set.seed(1234)

rep_vector <- function(vv, n) {
  unlist(as.vector(lapply(vv, function(...) {rep(...,n)} )))
}

n=7
m=3
df = data.frame(S = rep_vector(10:12, n), D = 20:26,
                Y = rep_vector(2005:2007, n),
                Week = round(52*runif(m*n)),
                X = 4e4*runif(m*n) + 1e4 )

# Now drop some rows, to model structurally missing rows
I <- sort(sample(1:nrow(df),0.6*nrow(df)))
df = df[I,]

require(dplyr)

Dakar answered 2/5, 2014 at 23:31 Comment(4)

"You're not going to get anything more". Sorry, but it shouldn't be hard to cook-up some random data. – Uella 3/5, 2014 at 0:2

@Arun: due to popular demand above is sample data... not that it was strictly needed... now may I expect some answers? – Dakar 3/5, 2014 at 2:59

I guess the root-cause is dplyr summarise: Equivalent of “.drop=FALSE” to keep groups with zero length in output ; github.com/hadley/dplyr/issues/341 . But I need a workaround, today. – Dakar 3/5, 2014 at 3:11

To the downvoters: if you really want to be annoying, the code is from a live data-science competition whose rules prevent sharing code. To be clear, I'm not sharing any code here, because this snippet is very heavily sanitized. This question is only about the general dplyr idiom. It's really aggravating me to have two people downvote, but can't be bothered doing anything useful once I do provide sample data. – Dakar 3/5, 2014 at 8:6

I don't think this has anything to do with the feature you've linked under comments (because IIUC that feature has to do with unused factor levels). Once you filter your data, IMO summarise should not (or rather can't?) be including them in the results (with the exception of factors). You should clarify this with the developers on their project page.

I'm by no means a dplyr expert, but I think, firstly, it'd be better to filter first followed by group_by + summarise. Else, you'll be filtering for each group, which is unnecessary. That is:

df.tmp <- df %.% filter(Week>=5 & Week<=43) %.% group_by(S,D,Y) %.% ...

This is just so that you're aware of it for any future cases.

IMO, it's better to use mutate here instead of summarise, as it'll remove the need for left_join, IIUC. That is:

df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
             md_X = median(X[Week >=5 & Week <= 43]), 
             mn_X = mean(X[Week >=5 & Week <= 43]))

Here, still we've the issue of replacing the NA/NaN. There's no easy/direct way to sub-assign here. So, you'll have to use ifelse, once again IIUC. But that'd be a little nicer if mutate supports expressions.

What I've in mind is something like:

df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
              { tmp = Week >= 5 & Week <= 43;
                md_X = ifelse(length(tmp), median(X[tmp]), 0), 
                md_Y = ifelse(length(tmp), mean(X[tmp]), 0)
              })

So, we'll have to workaround in this manner probably:

df.tmp = df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43)
df.tmp %.% mutate(md_X = ifelse(tmp[1L], median(X), 0), 
                  mn_X = ifelse(tmp[1L], mean(X), 0))

Or to put things together:

df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43, 
       md_X = ifelse(tmp[1L], median(X), 0), 
       mn_X = ifelse(tmp[1L], median(X), 0)) 

#     S  D    Y Week        X   tmp     md_X     mn_X
# 1  10 20 2005    6 22107.73  TRUE 22107.73 22107.73
# 2  10 23 2005   32 18751.98  TRUE 18751.98 18751.98
# 3  10 25 2005   33 31027.90  TRUE 31027.90 31027.90
# 4  10 26 2005    0 46586.33 FALSE     0.00     0.00
# 5  11 20 2006   12 43253.80  TRUE 43253.80 43253.80
# 6  11 22 2006   27 28243.66  TRUE 28243.66 28243.66
# 7  11 23 2006   36 20607.47  TRUE 20607.47 20607.47
# 8  11 24 2006   28 22186.89  TRUE 22186.89 22186.89
# 9  11 25 2006   15 30292.27  TRUE 30292.27 30292.27
# 10 12 20 2007   15 40386.83  TRUE 40386.83 40386.83
# 11 12 21 2007   44 18049.92 FALSE     0.00     0.00
# 12 12 26 2007   16 35856.24  TRUE 35856.24 35856.24

which doesn't require df.tmp.

HTH

Uella answered 3/5, 2014 at 8:12 Comment(12)

I want dplyr not data.table, thanks. That's why its tagged dplyr. – Dakar 3/5, 2014 at 8:13

@smci, please read the entire answer before commenting. It just helps me to think in data.table. I've provided a dplyr solution. – Uella 3/5, 2014 at 8:13

I already read the whole answer in full. And I was typing point-by-point responses to your comments. It's really aggravating how you keep talking down to me. Can you fix your attitude? Please stop assuming incorrect things. – Dakar 3/5, 2014 at 8:15

You mean you're offended that I used data.table to get to the dplyr solution? Many things seem to be aggravating you. – Uella 3/5, 2014 at 8:18

I've edited out the data.table part. Hopefully, that'll calm you down at least a bit. Ciao. – Uella 3/5, 2014 at 8:23

Issue a) asked you to comment on the filter vs subset tradeoff ( or logical indexing as you did). I'm willing to take some performance hit on repeatedly doing subset() inside each summarize(... computation, in return for avoiding having to assign to a temporary df.tmp then left_join back to the parent df. The mutate you do with logical indexing is nice, I think we can avoid the need for df.tmp by creating tmp_X directly on df, then processing it, and finally dropping it with select(-tmp_X). – Dakar 3/5, 2014 at 8:26

Issue b) is still not solved, because this doesn't eliminate the temporary df.tmp, summarize() still blows up on empty groups:

summarize(empty_group = (length(X)==0 || length(X[Week>=5 & Week<=43,])==0), ysd_med543_X  = ifelse(!empty_group, median(X[Week>=5 & Week<=43,]), 0)...

gives Error in summarise_impl(.data, named_dots(...), environment()) : attempt to use zero-length variable name – Dakar 3/5, 2014 at 8:53

And I believe that's the same issue as the linked dplyr issue: dplyr summarise: Equivalent of “.drop=FALSE” to keep groups with zero length in output ; [github.com/hadley/dplyr/issues/341](github.com/hadley/dplyr/issues/341) . Causing that Error:... attempt to use zero-length variable name. – Dakar 3/5, 2014 at 8:56

But your mutate() solves things by avoiding summarize() entirely. Thanks! – Dakar 3/5, 2014 at 9:8

and don't forget to select(-tmp) – Dakar 3/5, 2014 at 9:13

md_X = median(tmp * X) and similarly for mean could be used for a small reduction in code size. md_X = tmp[1L] * median(X) would be another approach. – Vincentia 3/5, 2014 at 12:40

Guys, it's nearly perfect, but I see intermittent weirdness if I directly assign the result df <- df %.% group_by(S,D,Y) %.% mutate(...) %.% ungroup() Specifically, sometimes the new columns are all correct, but at other times they're all zero! How does that happen?? – Dakar 4/5, 2014 at 1:19

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags