Assign multiple columns using := in data.table, by group
Asked Answered
A

2

175

What is the best way to assign to multiple columns using data.table? For example:

f <- function(x) {c("hi", "hello")}
x <- data.table(id = 1:10)

I would like to do something like this (of course this syntax is incorrect):

x[ , (col1, col2) := f(), by = "id"]

And to extend that, I may have many columns with names stored in a variable (say col_names) and I would like to do:

x[ , col_names := another_f(), by = "id", with = FALSE]

What is the correct way to do something like this?

Amoroso answered 27/7, 2012 at 2:13 Comment(3)
This looks like it has been answered: #11309254Amoroso
Alex, That answer is close but it doesn't seem to work in combination with by as @Christoph_J is correct to say. Link to your question added to FR#2120 "Drop needing with=FALSE for LHS of :=", so it won't get forgotten to revisit.Syncretism
To be clear, f() is a function returning multiple values, one for each of your columns.Ostler
S
196

This now works in v1.8.3 on R-Forge. Thanks for highlighting it!

x <- data.table(a = 1:3, b = 1:6) 
f <- function(x) {list("hi", "hello")} 
x[ , c("col1", "col2") := f(), by = a][]
#    a b col1  col2
# 1: 1 1   hi hello
# 2: 2 2   hi hello
# 3: 3 3   hi hello
# 4: 1 4   hi hello
# 5: 2 5   hi hello
# 6: 3 6   hi hello

x[ , c("mean", "sum") := list(mean(b), sum(b)), by = a][]
#    a b col1  col2 mean sum
# 1: 1 1   hi hello  2.5   5
# 2: 2 2   hi hello  3.5   7
# 3: 3 3   hi hello  4.5   9
# 4: 1 4   hi hello  2.5   5
# 5: 2 5   hi hello  3.5   7
# 6: 3 6   hi hello  4.5   9 

mynames = c("Name1", "Longer%")
x[ , (mynames) := list(mean(b) * 4, sum(b) * 3), by = a]
#     a b col1  col2 mean sum Name1 Longer%
# 1: 1 1   hi hello  2.5   5    10      15
# 2: 2 2   hi hello  3.5   7    14      21
# 3: 3 3   hi hello  4.5   9    18      27
# 4: 1 4   hi hello  2.5   5    10      15
# 5: 2 5   hi hello  3.5   7    14      21
# 6: 3 6   hi hello  4.5   9    18      27


x[ , get("mynames") := list(mean(b) * 4, sum(b) * 3), by = a][]  # same
#    a b col1  col2 mean sum Name1 Longer%
# 1: 1 1   hi hello  2.5   5    10      15
# 2: 2 2   hi hello  3.5   7    14      21
# 3: 3 3   hi hello  4.5   9    18      27
# 4: 1 4   hi hello  2.5   5    10      15
# 5: 2 5   hi hello  3.5   7    14      21
# 6: 3 6   hi hello  4.5   9    18      27

x[ , eval(mynames) := list(mean(b) * 4, sum(b) * 3), by = a][]   # same
#    a b col1  col2 mean sum Name1 Longer%
# 1: 1 1   hi hello  2.5   5    10      15
# 2: 2 2   hi hello  3.5   7    14      21
# 3: 3 3   hi hello  4.5   9    18      27
# 4: 1 4   hi hello  2.5   5    10      15
# 5: 2 5   hi hello  3.5   7    14      21
# 6: 3 6   hi hello  4.5   9    18      27

Older version using the with argument (we discourage this argument when possible):

x[ , mynames := list(mean(b) * 4, sum(b) * 3), by = a, with = FALSE][] # same
#    a b col1  col2 mean sum Name1 Longer%
# 1: 1 1   hi hello  2.5   5    10      15
# 2: 2 2   hi hello  3.5   7    14      21
# 3: 3 3   hi hello  4.5   9    18      27
# 4: 1 4   hi hello  2.5   5    10      15
# 5: 2 5   hi hello  3.5   7    14      21
# 6: 3 6   hi hello  4.5   9    18      27
Syncretism answered 6/10, 2012 at 8:48 Comment(15)
Thanks for this answer and the examples. How should I modify the following line in order to get two columns for each objectName from the dim output, rather than one column with two rows? data.table(objectName=ls())[,c("rows","cols"):=dim(get(objectName)),by=objectName] (I'm using data.table 1.8.11)Enrol
@Enrol dim returns a vector so converting that to type list should rotate it; e.g. [,c("rows","cols"):=as.list(dim(get(objectName))),by=objectNa‌​me]. Trouble is that as.list has call overhead and also copies the small vector. If efficiency is a problem as number of groups rises then please let us know.Syncretism
Thanks @Matt_Dowle. I had tried list but not as.list. Speed isn't an issue. Just wanted a quick way to find objects in the environment that had a certain number of columns or rows. This is off topic, but... what do you think about adding NCOL to tables()?Enrol
@Enrol Great idea, done. Just committed to v1.9.3 on R-Forge.Syncretism
Hi Matt. The first example in your second code block (i.e. x[,mynames:=list(mean(b)*4,sum(b)*3),by=a,with=FALSE][]) now throws a warning, so maybe remove it? On a related note, has anyone suggested that, with options(datatable.WhenJisSymbolThenCallingScope=TRUE), an assignment like x[,mynames:=list(mean(b)*4,sum(b)*3),by=a] should in fact work? Seems like that would be consistent with the other changes, though I guess it might break too much existing user code (?).Plastic
@JoshO'Brien Matt removed that option in response to a bug shortly after your comment, fyi github.com/Rdatatable/data.table/issues/1952Reichel
@MattDowle can you make it such that this syntax would work too? x[, .(Col1, Col2) := 'somevalue'] This is particularly useful because you can then use RStudio's autocomplete for column names as you type them.Cancan
@Cancan Seems like a good idea, should be possible. Please file a feature request on GitHub here - thanks.Syncretism
I don't think by=a is necessary; not sure why it would be. I tried and it worked.Owades
@PanFrancisco Without by=a it will work, but return a different answer. The mean(a) and sum(a) aggregates are being recycled within each group when by=a. Without by=a it just sticks the mean and sum for the entire column into each cell (i.e. different numbers).Syncretism
@MattDowle what if my function already returns named list, is there anyway I can add the columns to the dt without having to name them again? e.g. f <- function(x) {list("c"="hi", "d"="hello")} will print results with named cols with x[ , f(), by = a][] . I don't know how to append the result to the dt.Infatuate
@Jfly That would be a good new question which would likely lead to a feature request filed on GitHub. Perhaps something like x[, {ans=f(); names(ans):=ans}, by=a] could be implemented. That syntax conveys the intent quite nicely to my eye. What you think?Syncretism
Why do you use mynames with and without parenthesis? Is there a difference?Simeon
@HermanToothrot mynames:= assigns to the column called "mynames". (mynames):= assigns to the column name held in the mynames variable in calling scope.Syncretism
@FengJiang data.table #1543 is tracking this issue and links to multiple questions here.Conley
B
82

The following shorthand notation might be useful. All credit goes to Andrew Brooks, specifically this article.

dt[,`:=`(avg=mean(mpg), med=median(mpg), min=min(mpg)), by=cyl]
Burglarious answered 1/4, 2018 at 11:37 Comment(1)
This is so much better and more readable than the c() := list()Cloudberry

© 2022 - 2024 — McMap. All rights reserved.