Speed up this loop to create dummy columns with data.table and set in R [duplicate]
Asked Answered
D

2

5

I have a data table and I want to create a new column for each unique day, and then assign a 1 in each row where the day matches the column name

I have done this using a for loop but I was wondering if there was any way to optimise it using data.table and set?

Here is an example

dt <- data.table(Week_Day = c("Monday", "Tuesday", "Wednesday",
                          "Thursday", "Friday", "Saturday", "Sunday"))

Day <- unique(dt$Week_Day)
for (i in 1:length(Day)) {
    if (Day[i] != "Sunday") {
        dt[, Day[i] := ifelse(Week_Day == Day[i], 1, 0)]
    }
}

my table is 298k rows and although it doesn't take long to execute (below), its part of a long script and I have quite a few inefficient loops so I am trying to get the overall run time down.

Time to run:

user  system elapsed
0.99    0.06    1.05

Thanks in advance.

Doubletime answered 22/7, 2016 at 12:9 Comment(2)
Do look at model.matrix() as well. Perhaps by usage you want something like that.Ridenhour
Related question: https://mcmap.net/q/600258/-creating-dummy-variables-in-r-data-tableElmiraelmo
B
7

Here's a different approach that, performs better - on my machine - than the original approach in the question

1) Get unique days except Sunday

Day <- setdiff(dt$Week_Day, "Sunday")

2) Initialize new columns with 0:

dt[, (Day) := 0L]

3) Update with 1s by reference in a loop:

for(x in Day) {
  set(dt, i = which(dt[["Week_Day"]] == x), j = x, value = 1L)
}

Simple performance comparison:

dt1 <- data.table(Week_Day = sample(c("Monday", "Tuesday", "Wednesday",
                              "Thursday", "Friday", "Saturday", "Sunday"), 3e5, TRUE))

dt2 <- copy(dt1)


system.time({
  Day <- setdiff(unique(dt$Week_Day), "Sunday")
  dt1[, (Day) := 0L]
  for(x in Day) {
    set(dt1, i = which(dt1[["Week_Day"]] == x), j = x, value = 1L)
  }
})
#       User      System verstrichen 
#      0.029       0.003       0.032 

system.time({
  Day <- unique(dt$Week_Day)
  for (i in 1:length(Day)) {
    if (Day[i] != "Sunday") {
      dt2[, Day[i] := ifelse(Week_Day == Day[i], 1L, 0L)]
    }
  }
})

#       User      System verstrichen 
#      0.138       0.070       0.210 


all.equal(dt1, dt2)
#[1] TRUE
Beaton answered 22/7, 2016 at 12:35 Comment(4)
Amazing!! On my data of 300k rows the execution time was: user system elapsed 0.02 0.00 0.01 This leads me onto posting another question about a nested loop that is taking 'too' long.Doubletime
probably possible speed up could be from using index on Week_Day as it seems we subset on it multiple times.Gallman
@docendo I've used your solution to add seasonal variables too but I omitted the brackets around 'Day' when adding the new variable by mistake dt1[, (Day) := 0L] Doing this meant the 'Day' variable remained and I had NA's where I would have expected a zero. Do you know what adding the brackets does? Just so I can understand exactly what is happening. ThanksDoubletime
@Doubletime By adding brackets, the expression inside is evaluated. Otherwise, the column is just called 'Day'. You can have a look at the vignettes for data.table on github for more informationBeaton
C
4

Here is one attempt at a speed up:

Day <- unique(dt$Week_Day)
setkey(dt, Week_Day)

# create columns of 0s
dt[, (Day) := 0L]

for (i in seq_along(head(Day, -1))) {
     dt[Day[i], Day[i] := 1L]
}

This implements a couple of the data.table speed ups including binary search in the second chain and the elimination of ifelse with replacement by reference.

Celestyna answered 22/7, 2016 at 12:33 Comment(3)
Thanks, @arun. I didn't know that it was possible to "vectorize" the creation of variables with := in such a manner. Really cool technique.Celestyna
Sure, RHS is just recycled first to the length of LHS, and then to number of rows.. (for convenience).Bagpipes
@Bagpipes thanks! This was also blazingly fast on my machine. user system elapsed 0.03 0.00 0.03 I now know my script must be very inefficient because I have quite a few nested loops. I shall be posting another question at some point. Thanks for your help :)Doubletime

© 2022 - 2024 — McMap. All rights reserved.