R data.table use variable name for assignment in group by [duplicate]
Asked Answered
V

2

5

I am trying to summarize a data.table using a character variable as the name for the new column along with by.

library(data.table)
dt <- data.table(g = rep(1:3, 4), xa = runif(12), xb = runif(12))
# desired output
dt[, .(sa = mean(xa)), by = g]
   g       sa
1: 1 1.902360
2: 2 2.149041
3: 3 2.586044

The issue is that the following code returns the entire data.table still, without reducing to just the unique values of g.

cn <- paste0('s', 'a')
# returns all rows
dt[, (cn) := mean(xa), by = g][]
g        xa         xb        sa
1: 1 0.3423699 0.81447505 0.4755900
 2: 2 0.0932055 0.06853225 0.5372602
 3: 3 0.2486223 0.13286546 0.6465111
 4: 1 0.6942175 0.66405944 0.4755900
 5: 2 0.7225208 0.83110248 0.5372602
 6: 3 0.9898293 0.09520907 0.6465111
 7: 1 0.3523753 0.72743182 0.4755900
 8: 2 0.5504942 0.01966303 0.5372602
 9: 3 0.3523625 0.55257436 0.6465111
10: 1 0.5133974 0.39650089 0.4755900
11: 2 0.7828203 0.89909528 0.5372602
12: 3 0.9952302 0.16872205 0.6465111

How do I get the usual summarized data.table? (This is a simplified example. In my actual problem, there will be multiple names passed to a loop)

Vidicon answered 16/9, 2020 at 18:31 Comment(0)
T
3

Either use setNames wrapped around the list (.(mean(xa))) column or

dt[, setNames(.(mean(xa)), cn), by = g]
#  g        sa
#1: 1 0.2010599
#2: 2 0.4710056
#3: 3 0.4871248

or the setnames after getting the summarised output

setnames(dt[, mean(xa), by = g], 'V1', cn)[]

In data.table, := operator is used for creating/modifying a column in the original dataset. But, this operator is different when used in the tidyverse context

library(dplyr)
dt %>%
    group_by(g) %>% 
    summarise(!! cn := mean(xa), .groups = 'drop')
# A tibble: 3 x 2
#      g    sa
#  <int> <dbl>
#1     1 0.201
#2     2 0.471
#3     3 0.487
Trihedral answered 16/9, 2020 at 18:32 Comment(4)
That works! Seems like strange syntax compared to what I was expectingVidicon
@drj3122 as the same operator symbol is used in tidyverse and data.table, there can be confusion about the behaviorTrihedral
I should have added I originally tried dt[, .((cn) = mean(xa)), by = g], which I thought is consistent with data.table syntax, but that returns Error: unexpected '=' in "dt[, .((cn) ="Vidicon
@drj3122 names on the lhs of = doesn't get evaluated correctly. If you use cn = mean(xa), then the column name would be the literal 'cn'Trihedral
M
8

There is a pending PR which will make this kind of operations much easier, data.table#4304. Once implemented in current design the query will looks like:

dt[, .(cn = mean(xa)), by = g, env = list(cn="sa")]
#       g        sa
#   <int>     <num>
#1:     1 0.2060352
#2:     2 0.1707827
#3:     3 0.6850591

installation of PR branch

remotes::install_github("Rdatatable/data.table@programming")

data

library(data.table)
dt <- data.table(g = rep(1:3, 4), xa = runif(12), xb = runif(12))
Mccourt answered 17/9, 2020 at 7:29 Comment(1)
That looks so much more intuitive!Vidicon
T
3

Either use setNames wrapped around the list (.(mean(xa))) column or

dt[, setNames(.(mean(xa)), cn), by = g]
#  g        sa
#1: 1 0.2010599
#2: 2 0.4710056
#3: 3 0.4871248

or the setnames after getting the summarised output

setnames(dt[, mean(xa), by = g], 'V1', cn)[]

In data.table, := operator is used for creating/modifying a column in the original dataset. But, this operator is different when used in the tidyverse context

library(dplyr)
dt %>%
    group_by(g) %>% 
    summarise(!! cn := mean(xa), .groups = 'drop')
# A tibble: 3 x 2
#      g    sa
#  <int> <dbl>
#1     1 0.201
#2     2 0.471
#3     3 0.487
Trihedral answered 16/9, 2020 at 18:32 Comment(4)
That works! Seems like strange syntax compared to what I was expectingVidicon
@drj3122 as the same operator symbol is used in tidyverse and data.table, there can be confusion about the behaviorTrihedral
I should have added I originally tried dt[, .((cn) = mean(xa)), by = g], which I thought is consistent with data.table syntax, but that returns Error: unexpected '=' in "dt[, .((cn) ="Vidicon
@drj3122 names on the lhs of = doesn't get evaluated correctly. If you use cn = mean(xa), then the column name would be the literal 'cn'Trihedral

© 2022 - 2024 — McMap. All rights reserved.