Fill NAs in R with zero if the next valid data point is more than 2 intervals away
Asked Answered
P

6

13

I have multiple vectors with NAs and my intention to fill NA which are more than 2 intervals from a valid data point with 0. for example:

x <- c(3, 4, NA, NA, NA, 3, 3)

Expected output is,

3, 4, NA, 0, NA, 3, 3 
Phytobiology answered 20/6, 2019 at 18:49 Comment(1)
Maybe you should change the accepted answer to another one.Cnidus
Z
13

Update -

Here's probably one of the simplest and fastest solutions (Thanks to answer from G. Grothendieck). Simply knowing whether the value is NA on either side of any NA is sufficient information. Therefore, using lead and lag from dplyr package -

na2zero <- function(x) {
  x[is.na(lag(x, 1, 0)) & is.na(lead(x, 1, 0)) & is.na(x)] <- 0
  x
}

na2zero(x = c(3, 4, NA, NA, NA, 3, 3))
[1]  3  4 NA  0 NA  3  3

na2zero(x = c(3, 4, NA, NA, NA, NA, NA, 3, 3))
[1]  3  4 NA  0  0  0 NA  3  3

na2zero(x = c(3, 4, NA, NA, NA, 3, 3, NA, NA, 1, NA, 0, 0, rep(NA, 4L)))
[1]  3  4 NA  0 NA  3  3 NA NA  1 NA  0  0 NA  0  0 NA

Previous Answer (also fast) -

Here's one way using rle and replace from base R. This method turns every NA, that is not an endpoint in the running length, into a 0 -

na2zero <- function(x) {
  run_lengths <- rle(is.na(x))$lengths
  replace(x, 
    sequence(run_lengths) != 1 &
    sequence(run_lengths) != rep(run_lengths, run_lengths) &
    is.na(x),
  0)
}

na2zero(x = c(3, 4, NA, NA, NA, 3, 3))
[1]  3  4 NA  0 NA  3  3

na2zero(x = c(3, 4, NA, NA, NA, NA, NA, 3, 3))
[1]  3  4 NA  0  0  0 NA  3  3

Updated Benchmarks -

set.seed(2)
x <- c(3, 4, NA, NA, NA, 3, 3)
x <- sample(x, 1e5, T)

microbenchmark(
  Rui(x),
  Shree_old(x), Shree_new(x),
  markus(x),
  IceCreamT(x),
  Uwe1(x), Uwe2(x), Uwe_Reduce(x),
  Grothendieck(x),
  times = 50
)

all.equal(Shree_dplyr(x), Rui(x)) # [1] TRUE
all.equal(Shree_dplyr(x), Shree_rle(x)) # [1] TRUE
all.equal(Shree_dplyr(x), markus(x)) # [1] TRUE
all.equal(Shree_dplyr(x), Uwe1(x)) # [1] TRUE
all.equal(Shree_dplyr(x), Uwe2(x)) # [1] TRUE
all.equal(Shree_dplyr(x), Uwe_Reduce(x)) # [1] TRUE
all.equal(Shree_dplyr(x), Grothendieck(x)) # [1] TRUE


Unit: milliseconds
           expr        min         lq        mean     median          uq        max neval
         Rui(x) 286.026540 307.586604  342.620266 318.404731  363.844258  518.03330    50
   Shree_rle(x)  51.556489  62.038875   85.348031  65.012384   81.882141  327.57514    50
 Shree_dplyr(x)   3.996918   4.258248   17.210709   6.298946   10.335142  207.14732    50
      markus(x) 853.513854 885.419719 1001.450726 919.930389 1018.353847 1642.25435    50
   IceCreamT(x)  12.162079  13.773873   22.555446  15.021700   21.271498  199.08993    50
        Uwe1(x) 162.536980 183.566490  225.801038 196.882049  269.020395  439.17737    50
        Uwe2(x)  83.582360  93.136277  115.608342  99.165997  115.376903  309.67290    50
  Uwe_Reduce(x)   1.732195   1.871940    4.215195   2.016815    4.842883   25.91542    50
Grothendieck(x) 620.814291 688.107779  767.749387 746.699435  850.442643  982.49094    50

PS: Do check out TiredSquirell's answer which seems like a base version of Uwe's lead-lag answer but is somewhat faster (not benchmarked above).

Zymosis answered 20/6, 2019 at 19:5 Comment(7)
While this works for the provided example, I don't know that it gives the desired behavior for say, x <- c(3, 4, NA, NA, NA, NA, 3, 3).Overstep
@Overstep It gives [1] 3 4 NA NA 0 NA 3 3 which seems correct. Am I missing something?Zymosis
Not sure, the way I read OP's question I'd expect [1] 3 4 NA 0 0 NA 3 3 to be the desired result, but that's just how I interpreted it.Overstep
Nah, the first 0 in your output is > 2 away from nearest valid point.Zymosis
The question says more than 2 intervals from a valid data pointOverstep
Have you seen this new answer?Cnidus
@RuiBarradas I think it's just a base version of Uwe's answer (certainly faster though!). Probably should've been an edit or suggestion to his answer. I am not updating benchmarks anymore as it's too much work with so many answers now. I'll add a note to my answer instead. Thanks! :)Zymosis
C
8

Maybe there are simpler solutions but this one works.

na2zero <- function(x){
  ave(x, cumsum(abs(c(0, diff(is.na(x))))), FUN = function(y){
    if(anyNA(y)){
      if(length(y) > 2) y[-c(1, length(y))] <- 0
    }
    y
  })
}

na2zero(x)
#[1]  3  4 NA  0 NA  3  3

X <- list(x, c(x, x), c(3, 4, NA, NA, NA, NA, 3, 3))
lapply(X, na2zero)
Cnidus answered 20/6, 2019 at 19:5 Comment(0)
H
8

Here's a data.table option

library(data.table)

na0_dt <- function(x){
  replace(x, rowid(r <- rleid(xna <- is.na(x))) > 1 & rev(rowid(rev(r))) > 1 & xna, 0)
}
Hardy answered 20/6, 2019 at 19:59 Comment(4)
Interestingly, my base R solution seems faster for x of length 1e3. I have no idea how data.table works but seems like it has some initial overhead after which it just takes off! Therefore longer the vector more it can compensate.Zymosis
That's interesting, I added that case to the answer. I thought maybe the flipping-of-speeds would be because of the [.data.table overhead but it persists even after rewriting my function to avoid [.data.tableHardy
Your benchmark makes me feel I should downvote my answer. Tried it, but it's not possibly unfortunately.Visitant
Ha, that wasn't my intention. Even for a vector of length 1e5 the times are listed in milliseconds, so the timing isn't all that important for most cases probably.Hardy
A
8

For the sake of completeness, here are three other data.table approaches:

x <- c(3, 4, NA, NA, NA, 3, 3, NA, NA, 1, NA, 0, 0, rep(NA, 4L))

library(data.table)
data.table(x)[, x := replace(x, which(is.na(x))[-c(1L, .N)], 0), by =.(rleid(is.na(x)))]$x
[1]  3  4 NA  0 NA  3  3 NA NA  1 NA  0  0 NA  0  0 NA
x[data.table(x)[, .I[is.na(x)][-c(1L, .N)], by =.(rleid(is.na(x)))]$V1] <- 0
x
[1]  3  4 NA  0 NA  3  3 NA NA  1 NA  0  0 NA  0  0 NA

shift() & Reduce()

I was so focused on finding the right way to create groups that I started to think about the straightforward approach rather late. The rule is quite simple:

Replace all NAs by zero which are preceeded and succeeded by another NA.

This can be accomplished by zoo::rollapply() as in G. Grothendieck's answer or by using lag() & lead() like in Shree's latest edit.

However, my own benchmark (not posted here to avoid duplication with Shree' benchmark) shows that data.table::shift() and Reduce() is the fastest method so far.

  isnax <- is.na(x) 
  x[Reduce(`&`, data.table::shift(isnax, -1:1))] <- 0
  x

It is also slightly faster than using lag() & lead() (please, note that this differs from Shree's version as is.na() is only called once):

  isnax <- is.na(x) 
  x[isnax & dplyr::lag(isnax) & dplyr::lead(isnax)] <- 0
  x
Afield answered 20/6, 2019 at 20:8 Comment(2)
Indeed, it is faster. Nice! I'll add it to my benchmarks.Zymosis
@Shree, thank you for benchmarking all the different solutions. BTW: I have switched from microbenchmark to bench for doing benchmarks because it allows to vary problem sizes easily and to create charts. (I decided not to post the charts because you took the burden to do all the benchmarks, already.)Afield
F
6

Based on the example, I assume what you mean is that if the value is NA and adjacent values in both directions are NA (or in one direction if the value is first or last) then replace the value with 0. Using a centered rolling window of length 3 return TRUE if it is all NA and then replace the TRUE positions with 0. This gives the following one-liner

library(zoo)

replace(x, rollapply(c(TRUE, is.na(x), TRUE), 3, all), 0)
## [1]  3  4 NA  0 NA  3  3
Francenefrances answered 20/6, 2019 at 20:48 Comment(1)
zoo has some pretty handy functions to work with missing valuesSolomon
R
5

Here's a "stupidly simple" solution:

is_na <- is.na(x)       # Vector telling you whether each position in x is NA
na_before <- c(F,is_na[1:(length(x)-1)])    # Whether each position has an NA before it
na_after <- c(is_na[2:length(x),F)          # Whether each position has an NA after it
x[is_na & na_before & na_after] <- 0        # Set to 0 if all three are true

The creation of na_before and na_after are based on shifting one to the right or one to the left. To illustrate how this works, consider the letters below (I'm writing T and F as 1 and 0 to make them easier to distinguish):

              A  B  C  D  E
is_vowel      1  0  0  0  1
vowel_before  0  1  0  0  0
vowel_after   0  0  0  1  0

When you make vowel_before, you take the "10001" sequence of is_vowel and shift it one to the right (because each letter is now referring to the letter to its left). You drop the last 1 (you don't care that F has a vowel before it, because F isn't included) and you add a 0 at the beginning (the first letter has no letter before it, and therefore can't have a vowel before it). vowel_after is created with the same logic.

Edit. (Added by Rui Barradas)

This solution is, according to my benchmark, the fastest.
As a function:

TiredSquirrel <- function(x){
  is_na <- is.na(x)
  na_before <- c(FALSE, is_na[1:(length(x) - 1)])
  na_after <- c(is_na[2:length(x)], FALSE)
  x[is_na & na_before & na_after] <- 0
  x
}

And the benchmark.

x <- c(3, 4, NA, NA, NA, 3, 3)

r <- na2zero(x)
all.equal(r, TiredSquirrel(x))
#[1] TRUE

x <- sample(x, 1e3, TRUE)
r <- na2zero(x)
all.equal(r, TiredSquirrel(x))
#[1] TRUE

microbenchmark(
  Rui = na2zero(x),
  Uwe_Reduce = Uwe_Reduce(x),
  TiredSquirrel = TiredSquirrel(x)
)
#Unit: microseconds
#          expr      min        lq       mean    median        uq      max neval cld
#           Rui 3134.293 3198.8180 3365.70736 3263.7980 3391.7900 5593.111   100   b
#    Uwe_Reduce   99.895  104.3510  125.81417  113.9995  146.7335  244.280   100  a 
# TiredSquirrel   65.205   67.4365   72.41129   70.6430   75.8315  122.061   100  a 
Riordan answered 21/6, 2019 at 5:33 Comment(2)
Really nice! Welcome to SO, keep posting answers like this one, that's exactly what SO users need.Cnidus
Thank you!! I've edited your edit to show that credit for the benchmarking goes to you -- thanks for taking the time to do that!Riordan

© 2022 - 2024 — McMap. All rights reserved.