Replace a sequence of values by group depending on preceeding values
Asked Answered
I

4

10

I have a data table of this form (2000000+ rows, 1000+groups):

set.seed(1)    
dt <- data.table(id = rep(1:3, each = 5), values = sample(c("a", "b","c"), 15, TRUE))

> dt
    id values
 1:  1      a
 2:  1      c
 3:  1      a
 4:  1      b
 5:  1      a
 6:  2      c
 7:  2      c
 8:  2      b
 9:  2      b
10:  2      c
11:  3      c
12:  3      a
13:  3      a
14:  3      a
15:  3      b

I want to, within each ID group, replace the entire sequence of character "a", that precedes the character "b", and I want to replace them with "b". So the condition is that if "a" or a sequence of "a"s appear before "b", replace all the "a"s. (actually, in my real table, it's when "b" is preceded by "a","x", or"y", preceding character should be replaced, but I should be able to generalize)

In the example above,the value of "a" in row 3 should be replaced (easy to do with (shift) in data.table), as well as all the "a"s in rows 12-14 (not sure how to do). So, the desired output is this:

> dt
    id values
 1:  1      a
 2:  1      c
 3:  1      b
 4:  1      b
 5:  1      a
 6:  2      c
 7:  2      c
 8:  2      b
 9:  2      b
10:  2      c
11:  3      c
12:  3      b
13:  3      b
14:  3      b
15:  3      b

What comes to my mind is looping from the last index, but I am not exactly sure how to do that with if I have multiple groupings (say, ID and DATE), and anyway, this doesn't seem to be the fastest dt solution.

Isadoraisadore answered 9/6, 2020 at 16:12 Comment(0)
B
1

You can use rle().

Note: To avoid ambiguity, I rename the "values" column to "var" because the rle() function also produces a list containing a vector named "values".

dt[, new := with(rle(var), rep(ifelse(values == "a" & c(values[-1], "") == "b", "b", values), lengths)), by = id]
dt

#     id var new
#  1:  1   a   a
#  2:  1   c   c
#  3:  1   a   b
#  4:  1   b   b
#  5:  1   a   a
#  6:  2   c   c
#  7:  2   c   c
#  8:  2   b   b
#  9:  2   b   b
# 10:  2   c   c
# 11:  3   c   c
# 12:  3   a   b
# 13:  3   a   b
# 14:  3   a   b
# 15:  3   b   b
Bughouse answered 9/6, 2020 at 16:44 Comment(6)
Thanks! Looks good, could you maybe help me understand how is lengths element from rle() used here, been trying to understand it for some time...Isadoraisadore
@Isadoraisadore For example, you can run x <- c("a", "a", "b", "b", "b") ; rle(x). It returns a list of 2 vectors, one is values and the other is lengths. lengths means the counts of elements in values. If you run with(rle(x), rep(values, lengths)), the output will be the same as the original x because elements in values are repeated lengths times.Bughouse
Thanks, a very neat solution! For anyone reading this, dt[, rle(var), by = id] basically creates a table showing sequences of strings without repetitions, with lengths column showing how many times each string is repeated. This is much simpler case, and everything else follows from thereIsadoraisadore
This is really neat as it follows the stated condition exactly and I am 100% sure that it should work as expected in a more complex table I am dealing withIsadoraisadore
I'll actually use dt$new <- dt[, rle(var), by = "id"][, rep(ifelse(values == "a" & shift(values, type= "lead") == "b", "b", values), lengths)] as I think that that sequence of commands is easiest to understand (important if you revisit your code after some time and forget how you came up with the solution :))Isadoraisadore
@Isadoraisadore good work. If you use shift(), remember to add fill = "". Otherwise, missing values will appear if the final element of dt$var is "a". You can set dt$var[15] <- "a" and run your code again.Bughouse
R
5

Here's another data.table approach:

dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(id, x=x-1, values="a")], 
   on = .(id, x, values), 
   values := "b"
   ][, x := NULL]
  • create a new column "x" with the run length ids per value grouped by id
  • join on itself while modifying the run length ids (x) to be the preceeding value and values to be "a" (the specific value you want to change), then update values with "b"
  • delete column x afterwards

The result is:

dt
#     id values
#  1:  1      a
#  2:  1      c
#  3:  1      b
#  4:  1      b
#  5:  1      a
#  6:  2      c
#  7:  2      c
#  8:  2      b
#  9:  2      b
# 10:  2      c
# 11:  3      c
# 12:  3      b
# 13:  3      b
# 14:  3      b
# 15:  3      b

And here's a generalization to the case where you want to replace values "a", "x", or "y" followed by "b" with "b":

dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(values=c("a", "x", "y")), by = .(id, x=x-1)], 
   on = .(id, x, values), 
   values := "b"
   ][, x := NULL]
Rosio answered 9/6, 2020 at 20:5 Comment(0)
L
4

Late to the party and several nice run length alternatives were already provided ;) So here I try nafill instead.

(1) Create a variable 'v2' which is NA when 'values' are "a". (2) Fill missing values by next observation carried backward. (3) When the original 'values' are "a" and the corresponding filled values in 'v2' are "b", update 'v' with 'v2'.

# 1
dt[values != "a" , v2 := values]

# 2
d1[, v2 := v2[nafill(replace(seq_len(.N), is.na(v2), NA), type = "nocb")], by = id]

# 3
dt[values == "a" & v2 == "b", values := v2]

# clean-up
dt[ , v2 := NULL]

Currently, nafill only works with numeric variables, hence replace step in chunk # 2 (modified from @chinsoon12 in the issue nafill, setnafill for character, factor and other types).

The NA replacement code may be slightly shortened by using zoo::nalocf:

dt[, v2 := zoo::na.locf(v2, fromLast = TRUE, na.rm = FALSE), by = id]

However, note that na.locf is slower.


When comparing the answers on larger data (data.table(id = rep(1:1e4, each = 1e4, replace = TRUE), values = sample(c("a", "b", "c"), 1e8, replace = TRUE)), it turns out that this alternative actually is faster than the others.

Longhair answered 10/6, 2020 at 11:38 Comment(3)
Interesting solution, thanks! I really like it as rle might not be obvious to everyone reading your code. It can be made shorter by using zoo:na.locf (ri indices column in your example is only there because nafill doesn't work with characters I think). See: dt[values != "a" , v2 := values] ; dt[, v2 := zoo::na.locf(v2, fromLast = TRUE, na.rm = FALSE), by = "id"] ; dt[values == "a" & v2 == "b", values := v2]Isadoraisadore
Thanks a lot for your feedback @Djpengo. I tried na.locf before posting my answer, but it was considerably slower so I decided to exclude it (sorry, I should have mentioned that in the post). I might put it back.Longhair
I guess it's good if it's in there, as it depends what you are looking for, simplicity or speed. With DT it makes sense that one is indeed looking for the speed thoughIsadoraisadore
F
2

This is not pretty but I think this is what you are after:

dt[, .N, by = .(id, values = paste0(values, rleid(values)))
   ][, values := sub("[0-9]+", "", values)
     ][, values := fifelse(values == "a" & shift(values, -1L) == "b" & !is.na(shift(values, -1L)), "b", values), by = id
       ][, .SD[rep(seq_len(.N), N)]
         ][, !"N"]

    id values
 1:  1      a
 2:  1      c
 3:  1      b
 4:  1      b
 5:  1      a
 6:  2      c
 7:  2      c
 8:  2      b
 9:  2      b
10:  2      c
11:  3      c
12:  3      b
13:  3      b
14:  3      b
15:  3      b
Frasquito answered 9/6, 2020 at 16:44 Comment(0)
B
1

You can use rle().

Note: To avoid ambiguity, I rename the "values" column to "var" because the rle() function also produces a list containing a vector named "values".

dt[, new := with(rle(var), rep(ifelse(values == "a" & c(values[-1], "") == "b", "b", values), lengths)), by = id]
dt

#     id var new
#  1:  1   a   a
#  2:  1   c   c
#  3:  1   a   b
#  4:  1   b   b
#  5:  1   a   a
#  6:  2   c   c
#  7:  2   c   c
#  8:  2   b   b
#  9:  2   b   b
# 10:  2   c   c
# 11:  3   c   c
# 12:  3   a   b
# 13:  3   a   b
# 14:  3   a   b
# 15:  3   b   b
Bughouse answered 9/6, 2020 at 16:44 Comment(6)
Thanks! Looks good, could you maybe help me understand how is lengths element from rle() used here, been trying to understand it for some time...Isadoraisadore
@Isadoraisadore For example, you can run x <- c("a", "a", "b", "b", "b") ; rle(x). It returns a list of 2 vectors, one is values and the other is lengths. lengths means the counts of elements in values. If you run with(rle(x), rep(values, lengths)), the output will be the same as the original x because elements in values are repeated lengths times.Bughouse
Thanks, a very neat solution! For anyone reading this, dt[, rle(var), by = id] basically creates a table showing sequences of strings without repetitions, with lengths column showing how many times each string is repeated. This is much simpler case, and everything else follows from thereIsadoraisadore
This is really neat as it follows the stated condition exactly and I am 100% sure that it should work as expected in a more complex table I am dealing withIsadoraisadore
I'll actually use dt$new <- dt[, rle(var), by = "id"][, rep(ifelse(values == "a" & shift(values, type= "lead") == "b", "b", values), lengths)] as I think that that sequence of commands is easiest to understand (important if you revisit your code after some time and forget how you came up with the solution :))Isadoraisadore
@Isadoraisadore good work. If you use shift(), remember to add fill = "". Otherwise, missing values will appear if the final element of dt$var is "a". You can set dt$var[15] <- "a" and run your code again.Bughouse

© 2022 - 2024 — McMap. All rights reserved.