Apply different functions to different columns programmatically in data.table R
Asked Answered
A

2

6

I need to programmatically apply different functions to different columns and group by, using data.table.

If the columns and functions were known, I would do like this:

library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
                v1 = rep(c(2, 3, 4), each=3),
                v2 = rep(c(5, 10, 15), each=3))
DT
#>    id v1 v2
#> 1:  a  2  5
#> 2:  a  2  5
#> 3:  a  2  5
#> 4:  b  3 10
#> 5:  b  3 10
#> 6:  b  3 10
#> 7:  c  4 15
#> 8:  c  4 15
#> 9:  c  4 15
DT[, .(v1=mean(v1), v2=sum(v2)), keyby=.(id)]
#>    id v1 v2
#> 1:  a  2 15
#> 2:  b  3 30
#> 3:  c  4 45

But I want to do this by passing the column names and their specific function:

aggregate_functions = list(v1=mean, v2=sum)
col_selection = c('v1', 'v2')

I wrote something like this by I can't figure out a way of passing the column name to lapply:

DT[, lapply(.SD, 
            aggregate_functions[[col_name]] # some way of selecting the right function from aggregate_functions
            ), 
   .SDcols = col_selection, 
   by=id]

I have also tried with meltand dcast, but the latter applies all the functions to all the columns:

library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
                v1 = rep(c(2, 3, 4), each=3),
                v2 = rep(c(5, 10, 15), each=3))
DTm = melt(DT, meaure.vars=col_selection, id.vars='id')
DTm
#>     id variable value
#>  1:  a       v1     2
#>  2:  a       v1     2
#>  3:  a       v1     2
#>  4:  b       v1     3
#>  5:  b       v1     3
#>  6:  b       v1     3
#>  7:  c       v1     4
#>  8:  c       v1     4
#>  9:  c       v1     4
#> 10:  a       v2     5
#> 11:  a       v2     5
#> 12:  a       v2     5
#> 13:  b       v2    10
#> 14:  b       v2    10
#> 15:  b       v2    10
#> 16:  c       v2    15
#> 17:  c       v2    15
#> 18:  c       v2    15
DTc = dcast(DTm, id~variable, fun.aggregate=list(sum, mean))
DTc
#>    id value_sum_v1 value_sum_v2 value_mean_v1 value_mean_v2
#> 1:  a            6           15             2             5
#> 2:  b            9           30             3            10
#> 3:  c           12           45             4            15

I could programmatically select and rename the relevant columns (3 and 4 in this case) but it doesn't look like an efficient approach.

Of course I could have a loop doing the job and merging the results, but I am looking for a data.table way.

Thank you for your answer and thank you to the team at data.table.

Created on 2019-11-26 by the reprex package (v0.3.0)

Abstractionist answered 26/11, 2019 at 21:39 Comment(0)
S
1

An option is to use mapply:

DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), id, 
    .SDcols=col_selection]

Need to careful on the ordering of col_selection and aggregate_functions so that the right function is applied to the right column.

output:

   id v1 v2
1:  a  2 15
2:  b  3 30
3:  c  4 45

Edit from the OP:

Just to complete this brilliant solution. This solution works very well and if we replace col_selection with names(aggregate_functions) there is no issue with the ordering. Plus it automatically discards all the columns that are not in the list:

library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
                v1 = rep(c(2, 3, 4), each=3),
                v2 = rep(c(5, 10, 15), each=3),
                id2 = c(rep(c('cc', 'dd'), 4), 'dd')
                )
aggregate_functions = list(v1=mean, v2=sum)
DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), id, 
   .SDcols=names(aggregate_functions)]
#>    id v1 v2
#> 1:  a  2 15
#> 2:  b  3 30
#> 3:  c  4 45

It is also possible to use multiple variables to aggregate by, by passing a list:

DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), list(id, id2), 
   .SDcols=names(aggregate_functions)]
#>    id id2 v1 v2
#> 1:  a  cc  2 10
#> 2:  a  dd  2  5
#> 3:  b  dd  3 20
#> 4:  b  cc  3 10
#> 5:  c  cc  4 15
#> 6:  c  dd  4 30

Created on 2019-11-27 by the reprex package (v0.3.0)

Sciolism answered 27/11, 2019 at 0:22 Comment(1)
This is great! I have added a few things to the original solution, you can remove them if you don't like it. Thank youAbstractionist
A
1

After I posted the question, a link to this answer by @Uwe appeared in the right box that holds the results I am looking for. I tweaked it to match my example:

library(magrittr)
library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
                v1 = rep(c(2, 3, 4), each=3),
                v2 = rep(c(5, 10, 15), each=3))
aggregate_functions = list(v1='mean', v2='sum')
col_selection = c('v1', 'v2')
aggregate_functions %>%
  names() %>% 
  lapply(
    function(col_selection) lapply(
      aggregate_functions[[col_selection]], 
      function(.fct) sprintf("%s = %s(%s)", col_selection, .fct, col_selection))) %>% 
  unlist() %>% 
  paste(collapse = ", ") %>% 
  sprintf("DT[, .(%s), by = id]", .) %>% 
  parse(text = .) %>% 
  eval()
#>    id v1 v2
#> 1:  a  2 15
#> 2:  b  3 30
#> 3:  c  4 45

I would still be interested in 'all in data.table' solutions.

Created on 2019-11-26 by the reprex package (v0.3.0)

Abstractionist answered 26/11, 2019 at 22:9 Comment(0)
S
1

An option is to use mapply:

DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), id, 
    .SDcols=col_selection]

Need to careful on the ordering of col_selection and aggregate_functions so that the right function is applied to the right column.

output:

   id v1 v2
1:  a  2 15
2:  b  3 30
3:  c  4 45

Edit from the OP:

Just to complete this brilliant solution. This solution works very well and if we replace col_selection with names(aggregate_functions) there is no issue with the ordering. Plus it automatically discards all the columns that are not in the list:

library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
                v1 = rep(c(2, 3, 4), each=3),
                v2 = rep(c(5, 10, 15), each=3),
                id2 = c(rep(c('cc', 'dd'), 4), 'dd')
                )
aggregate_functions = list(v1=mean, v2=sum)
DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), id, 
   .SDcols=names(aggregate_functions)]
#>    id v1 v2
#> 1:  a  2 15
#> 2:  b  3 30
#> 3:  c  4 45

It is also possible to use multiple variables to aggregate by, by passing a list:

DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), list(id, id2), 
   .SDcols=names(aggregate_functions)]
#>    id id2 v1 v2
#> 1:  a  cc  2 10
#> 2:  a  dd  2  5
#> 3:  b  dd  3 20
#> 4:  b  cc  3 10
#> 5:  c  cc  4 15
#> 6:  c  dd  4 30

Created on 2019-11-27 by the reprex package (v0.3.0)

Sciolism answered 27/11, 2019 at 0:22 Comment(1)
This is great! I have added a few things to the original solution, you can remove them if you don't like it. Thank youAbstractionist

© 2022 - 2024 — McMap. All rights reserved.