Efficient multiplication of columns in a data frame
Asked Answered
A

5

12

I have a large data frame in which I am multiplying two columns together to get another column. At first I was running a for-loop, like so:

for(i in 1:nrow(df)){
    df$new_column[i] <- df$column1[i] * df$column2[i]
}

but this takes like 9 days.

Another alternative was plyr, and I actually might be using the variables incorrectly:

new_df <- ddply(df, .(column1,column2), transform, new_column = column1 * column2)

but this is taking forever

Autum answered 10/9, 2012 at 18:40 Comment(7)
What's wrong with df$new_column <- df$column1 * df$column2? How big is your data frame?Passage
400000 rows should be quick even with plyr... but if you're just trying to multiply two columns @BlueMagister has the best solution. Even still, a for loop version shouldn't take 9 days...Madonia
well. the 9 days was a slight exaggeration, forgot to mention I was jk-ing thereAutum
Most operations in R are vectorized, so you can multiply vectors by vectors and it will multiply entries of the same index together. The problem with the for-loop is that R creates a new data frame for every iteration of the loop. The solution I suggested creates just one new data frame instead of 400K new data frames.Passage
yes, it is beautiful. this is almost a good as sliced breadAutum
"time-efficient" and "plyr" are not generally used in the same sentence. If speed is your goal, you should look at the 'data.table' package. (... although 400K x 7 is a tiny dataset these days, and ordinary functions as you have been offered in the answers below should suffice.)Lucilius
The reason plyr is so slow here is that you are unneccarily grouping over column1 and column2. Ths creates groups for every unique combination of these columns. ddply is not required. data.table will do the creation of a new column efficiently.Aggregation
S
26

As Blue Magister said in comments,

df$new_column <- df$column1 * df$column2

should work just fine. Of course we can never know for sure if we don't have an example of the data.

Swagman answered 10/9, 2012 at 18:44 Comment(5)
Even more beautiful, but essentially the same: df$new_column <- with( df , column1 * column2)Lucilius
@DWin that seems a little odd to use with() but then do an assign via $<-. Anything wrong with within() or transform()?Reptant
If you are assigning to just a single (new or otherwise) column, I think you need to use with instead of within, because within will return the entire data.frame. That's my understanding, anyway. (...and testing confirms that dat$new <- within(dat, old*2) will make a messy copying of a nested dataframe copy within the 'dat' object.Lucilius
Yes should work fine for the 20MB dataset (400k x 7 rows). But it will copy the whole 20MB, so don't repeat it too much. For example, adding 10 columns using $<- each time needs 400MB (400e3*sum(7:17)*8/1024^2). Adding 50 churns through 5GB (400e3*sum(7:57)*8/1024^2) to get to a 400k x 57 result (just 173MB). Just to be aware.Parterre
So, add columns in bulk if you can, not one by one. Since that can bite even at the 20MB size. Just in case that was going to be your next step. Or, use := which doesn't copy the whole 20MB.Parterre
A
12

A data.table solution will avoid lots of internal copying while having the advantages of not spattering the code with $.

 library(data.table)
 DT <- data.table(df)
 DT[ , new := column1 * column2]
Aggregation answered 10/9, 2012 at 20:7 Comment(0)
R
11

A minor, somewhat less efficient, version of Sacha's Answer is to use transform() or within()

df <- transform(df, new = column1 * column2)

or

df <- within(df, new <- column1 * column2)

(I hate spattering my user code with $.)

Reptant answered 10/9, 2012 at 18:49 Comment(10)
Why would this be less efficient? Is $<- optimized not to copy the data frame?Covariance
There have been some recent optimisations of certain ops on data frames, but don't recall if that is one of them. I assume this will be slightly less efficient because transform() and within() contain quite a few additional lines of R code on top of the one that evaluates the expression.Reptant
@GaborCsardi Unfortunately, $<- is not optimized not to copy in R; it copies the entire df. As does transform and within; they copy the entire df too. This is why data.table introduced :=, to allow assign by reference, as demo'd in mnel's answer.Parterre
@GavinSimpson All solutions other than := copy the entire df, so therefore don't scale. The extra lines of code in transform and within are tiny in comparison. Unless the call is looped, such as in transform-by-group, but then it's even more dominated by all the copies. If transform didn't copy then the number of lines would come into it.Parterre
@MatthewDowle This is hardly a huge problem and just using base R solutions properly will save the OP vast amounts of time. I agree moving to data.table for these things will pay off handsomely, but it is an extra set of functions & syntax to master. I don't need to be sold on the data.table goodness; I already sign from that hymn book. I just need to spend some time integrating it into my work flow a bit more so the syntax stick.Reptant
@GavinSimpson Sure, I was just answering the open points in your and Gabor's comments above. And I also had in mind other people reading this question who might have larger datasets or want to add more than 1 column.Parterre
@MatthewDowle: well, theoretically $<-.data.table could be optimized not to copy the whole data frame in this case. All I was asking, whether this was done, or not. data.table has a lot of nice features indeed. If only the syntax was more R-like...Covariance
@GaborCsardi Oh, $<-.data.table? No, that's not possible to not copy (as a mere package author), without a change to R itself. R itself makes the copy when dispatching the $<- method, iiuc. If R can be changed is the burning question. In the meantime I like data.table syntax, obviously. It's more like SQL than R. Some have said its syntax is trivial, which is nice to hear.Parterre
@MatthewDowle: sorry, typo, it is late. I was asking <-.data.frame, obviously. I don't know the R internals much, but I don't think copies are made unless the object is modified, so just dispatching should not make a copy. Re the data.table syntax, I am not saying it is difficult, it is just different.Covariance
but to be honest, these methods spatter brackets instead of dollar signs ;-)Draught
D
1

You can simply create a function to handle all sort of multiplications like this on:

GetMultiplication <- function(x,y) {
x *y
}

# for example:
xCol<-c(1,2,3,4,5)
yCol<-c(10,20,30,40,50)
const = 0.055

#Case 1: Column 1 * Column 2
ZCol_1 <- GetMultiplication (xCol,yCol)
print(ZCol_1)
#> [1]  10  40  90 160 250

#Case 2: Column 1 * (Column 1 * 10 + 1000)
ZCol_2 <- GetMultiplication (xCol,xCol*10 + 1000)
print(ZCol_2)
#> [1] 1010 2040 3090 4160 5250

#Case 3: Column 1 * a constant value
ZCol_3 <- GetMultiplication (xCol,const)
print(ZCol_3)
#> [1] 0.055 0.110 0.165 0.220 0.275
Denson answered 3/10, 2020 at 22:25 Comment(0)
B
1

this works with 2 or more numeric columns in a dataframe

 df$product <- apply(df,1,prod)
Body answered 23/3, 2022 at 21:58 Comment(1)
Very nice solution for situations where the number of columns in the data frame isn't always known.Unpeopled

© 2022 - 2024 — McMap. All rights reserved.