row-wise first/last occurrences from column series in data.table
Asked Answered
U

2

2

As a follow-up to this oldie but goodie: efficient row-wise operations on a data.table

I have some data that (unfortunately) look like:

library('data.table')
set.seed(1234)
m <- 5
n <- 7
rb <- function() runif(m,1000,2000) * rbinom(m,1,0.5)   
series_col_nms <- paste0('YearNo',1:n)    
rev <- data.table(cust_id = paste0('CustNo',1:m), 
                other_stuff = sample(letters,m, replace=TRUE))
for(col in series_col_nms){
  set(rev, j=col, value=rb())
}
setkey(rev, cust_id)

One row per customer with various columns including yearly revenue for year 1, 2, ...

I would like to get the year indices for the first and last years with any revenue from each customer.

I can produce the desired result, but with a somewhat hacky join:

years_active <- rev[, which(.SD>0), .SDcols = series_col_nms, 
                    keyby=cust_id][, .(min_year_active = min(V1),
                                       max_year_active = max(V1)), keyby=cust_id]
years_active[rev]

These attempts to get the minimum index fail:

rev[, apply(.SD, 1, function(x) min(which(x>0))), .SDcols=series_col_nms, by=cust_id] # returns data type error    
rev[, do.call(pmin, lapply(.SD, function(x) which(x>0))), .SDcols=series_col_nms, by=cust_id] # returns empty

What is the data.table way to do this?

Usual answered 30/11, 2015 at 21:38 Comment(0)
L
3

When you want to operate on many columns by row, the common practice it to melt your data set first and then operate on a single column.

In your case, a relatively straightforward solution could be something like

res <- melt(rev, id = 1:2)[, 
         as.list({
            temp <- value != 0
            if (any(temp)) range(which(temp)) else rep(NA_integer_, 2)
         }), 
        by = cust_id]

rev[, c("Min", "Max") := res[, .(V1, V2)]]
rev
#    cust_id other_stuff  YearNo1 YearNo2  YearNo3  YearNo4  YearNo5  YearNo6  YearNo7 Min Max
# 1: CustNo1           c 1640.311       0    0.000 1759.671    0.000 1503.933    0.000   1   6
# 2: CustNo2           q 1009.496       0    0.000 1201.248    0.000    0.000 1308.095   1   7
# 3: CustNo3           p    0.000       0    0.000    0.000 1484.991    0.000    0.000   5   5
# 4: CustNo4           q 1666.084       0 1831.345 1992.150 1243.929    0.000 1051.647   1   7
# 5: CustNo5           w    0.000       0    0.000    0.000    0.000    0.000    0.000  NA  NA

A cleaner version but with warnings could be

melt(rev, id = 1:2)[, as.list(as.integer(range(which(value != 0)))), by = cust_id]
Lasley answered 30/11, 2015 at 21:56 Comment(1)
I wouldn't call with warning version a cleaner one, but dirty one ;)Attract
B
2

Reshaping I would store the data in long format:

mrev = melt(rev, 
  id=c("cust_id","other_stuff"), 
  variable.name="YearNo", 
  value.name="revenue")[revenue > 0]

You lose customer 5 with the revenue > 0 condition, but I doubt that matters.

Then collect aggregate statistics as needed:

mrev[ , list(first = YearNo[1], last = YearNo[.N]), by=cust_id]

#    cust_id   first    last
# 1: CustNo1 YearNo1 YearNo6
# 2: CustNo2 YearNo1 YearNo7
# 3: CustNo4 YearNo1 YearNo7
# 4: CustNo3 YearNo5 YearNo5

Parsing out the numbers from the strings you've been using is simple, of course.


max.col I think this is a bad kludge but...

max.col(rev[,!c("cust_id","other_stuff"),with=FALSE] > 0, "first")
max.col(rev[,!c("cust_id","other_stuff"),with=FALSE] > 0, "last")

You'd have to go back and fill in the special case of all zeros (customer 5) separately.

Bluebonnet answered 30/11, 2015 at 21:57 Comment(9)
Thanks Frank and @DavidArenburg -- this pretty much confirms my original approach, although maybe explicit melting is faster -- do you see any way to use row-wise operations (pmin, .SD, etc.) for this?Usual
@Usual I think that if speed is a concern you should just stick to melted/long data, which data.table is better with. (This also helps you avoid awkwardly putting data into column names.) pmin happens to be a hack available for the question linked, but I wouldn't bet on such shortcuts generally being available.Bluebonnet
Thanks (and +1) -- if I could figure out a row-wise solution I would have benchmarked it, but good to know I am on the right track. I am going to accept @DavidArenburg's solution for his idea with range() (otherwise it's a tie)Usual
@Usual rowwise solutions generally bad idea in R unless fully vectorized, simply because R wasn't designed to operate on rows rather on vectors. melt is the natural transformation from operating on rows to operating on vectors. data.table devs developed melt to be fully optimized so this transformation will be seeming-less even for huge data sets.Lasley
@DavidArenburg -- after doing a little testing it looks like pmin is actually pretty fast. Let me know if I missed something. Cheers--Usual
pmin is vectorized and thus isn't doing by row operations. Please reread my comment. In your case there was no a straightforward pmin solution that we could identify, thus row-wise operations such as apply(df, 1,...) would be the worst solution possible. Also, pmin is specific to finding a minimum and that's all. By using melt first, you can aply any function afterwords, thus this is a general solution compared to a specific hack.Lasley
Btw, if you a Haldey fan, you may want to read this too.Lasley
@Usual I second David's recommendation of reading hadley on "tidy data". The idea is not that melting is the fastest way to create this particular pair of columns; it is that working with melted data will make your subsequent operations (which presumably go beyond just making a single column) faster, in both runtime and coding (since you won't have to dig up clever tricks/hacks for each one). The first sentence of my answer is to be taken literally: I would create mrev and rm(rev).Bluebonnet
@DavidArenburg thanks. I just signed up for DataCamp; hopefully after completing the data.table course there I will post more informed questionsUsual

© 2022 - 2024 — McMap. All rights reserved.