I have multiple vectors with NAs and my intention to fill NA which are more than 2 intervals from a valid data point with 0. for example:
x <- c(3, 4, NA, NA, NA, 3, 3)
Expected output is,
3, 4, NA, 0, NA, 3, 3
I have multiple vectors with NAs and my intention to fill NA which are more than 2 intervals from a valid data point with 0. for example:
x <- c(3, 4, NA, NA, NA, 3, 3)
Expected output is,
3, 4, NA, 0, NA, 3, 3
Update -
Here's probably one of the simplest and fastest solutions (Thanks to answer from G. Grothendieck). Simply knowing whether the value is NA
on either side of any NA
is sufficient information. Therefore, using lead
and lag
from dplyr
package -
na2zero <- function(x) {
x[is.na(lag(x, 1, 0)) & is.na(lead(x, 1, 0)) & is.na(x)] <- 0
x
}
na2zero(x = c(3, 4, NA, NA, NA, 3, 3))
[1] 3 4 NA 0 NA 3 3
na2zero(x = c(3, 4, NA, NA, NA, NA, NA, 3, 3))
[1] 3 4 NA 0 0 0 NA 3 3
na2zero(x = c(3, 4, NA, NA, NA, 3, 3, NA, NA, 1, NA, 0, 0, rep(NA, 4L)))
[1] 3 4 NA 0 NA 3 3 NA NA 1 NA 0 0 NA 0 0 NA
Previous Answer (also fast) -
Here's one way using rle
and replace
from base R. This method turns every NA
, that is not an endpoint in the running length, into a 0
-
na2zero <- function(x) {
run_lengths <- rle(is.na(x))$lengths
replace(x,
sequence(run_lengths) != 1 &
sequence(run_lengths) != rep(run_lengths, run_lengths) &
is.na(x),
0)
}
na2zero(x = c(3, 4, NA, NA, NA, 3, 3))
[1] 3 4 NA 0 NA 3 3
na2zero(x = c(3, 4, NA, NA, NA, NA, NA, 3, 3))
[1] 3 4 NA 0 0 0 NA 3 3
Updated Benchmarks -
set.seed(2)
x <- c(3, 4, NA, NA, NA, 3, 3)
x <- sample(x, 1e5, T)
microbenchmark(
Rui(x),
Shree_old(x), Shree_new(x),
markus(x),
IceCreamT(x),
Uwe1(x), Uwe2(x), Uwe_Reduce(x),
Grothendieck(x),
times = 50
)
all.equal(Shree_dplyr(x), Rui(x)) # [1] TRUE
all.equal(Shree_dplyr(x), Shree_rle(x)) # [1] TRUE
all.equal(Shree_dplyr(x), markus(x)) # [1] TRUE
all.equal(Shree_dplyr(x), Uwe1(x)) # [1] TRUE
all.equal(Shree_dplyr(x), Uwe2(x)) # [1] TRUE
all.equal(Shree_dplyr(x), Uwe_Reduce(x)) # [1] TRUE
all.equal(Shree_dplyr(x), Grothendieck(x)) # [1] TRUE
Unit: milliseconds
expr min lq mean median uq max neval
Rui(x) 286.026540 307.586604 342.620266 318.404731 363.844258 518.03330 50
Shree_rle(x) 51.556489 62.038875 85.348031 65.012384 81.882141 327.57514 50
Shree_dplyr(x) 3.996918 4.258248 17.210709 6.298946 10.335142 207.14732 50
markus(x) 853.513854 885.419719 1001.450726 919.930389 1018.353847 1642.25435 50
IceCreamT(x) 12.162079 13.773873 22.555446 15.021700 21.271498 199.08993 50
Uwe1(x) 162.536980 183.566490 225.801038 196.882049 269.020395 439.17737 50
Uwe2(x) 83.582360 93.136277 115.608342 99.165997 115.376903 309.67290 50
Uwe_Reduce(x) 1.732195 1.871940 4.215195 2.016815 4.842883 25.91542 50
Grothendieck(x) 620.814291 688.107779 767.749387 746.699435 850.442643 982.49094 50
PS: Do check out TiredSquirell's answer which seems like a base version of Uwe's lead-lag answer but is somewhat faster (not benchmarked above).
x <- c(3, 4, NA, NA, NA, NA, 3, 3)
. –
Overstep [1] 3 4 NA NA 0 NA 3 3
which seems correct. Am I missing something? –
Zymosis [1] 3 4 NA 0 0 NA 3 3
to be the desired result, but that's just how I interpreted it. –
Overstep more than 2 intervals from a valid data point
–
Overstep Maybe there are simpler solutions but this one works.
na2zero <- function(x){
ave(x, cumsum(abs(c(0, diff(is.na(x))))), FUN = function(y){
if(anyNA(y)){
if(length(y) > 2) y[-c(1, length(y))] <- 0
}
y
})
}
na2zero(x)
#[1] 3 4 NA 0 NA 3 3
X <- list(x, c(x, x), c(3, 4, NA, NA, NA, NA, 3, 3))
lapply(X, na2zero)
Here's a data.table option
library(data.table)
na0_dt <- function(x){
replace(x, rowid(r <- rleid(xna <- is.na(x))) > 1 & rev(rowid(rev(r))) > 1 & xna, 0)
}
x
of length 1e3
. I have no idea how data.table
works but seems like it has some initial overhead after which it just takes off! Therefore longer the vector more it can compensate. –
Zymosis [.data.table
overhead but it persists even after rewriting my function to avoid [.data.table
–
Hardy For the sake of completeness, here are three other data.table approaches:
x <- c(3, 4, NA, NA, NA, 3, 3, NA, NA, 1, NA, 0, 0, rep(NA, 4L))
library(data.table)
data.table(x)[, x := replace(x, which(is.na(x))[-c(1L, .N)], 0), by =.(rleid(is.na(x)))]$x
[1] 3 4 NA 0 NA 3 3 NA NA 1 NA 0 0 NA 0 0 NA
x[data.table(x)[, .I[is.na(x)][-c(1L, .N)], by =.(rleid(is.na(x)))]$V1] <- 0
x
[1] 3 4 NA 0 NA 3 3 NA NA 1 NA 0 0 NA 0 0 NA
shift()
& Reduce()
I was so focused on finding the right way to create groups that I started to think about the straightforward approach rather late. The rule is quite simple:
Replace all NAs by zero which are preceeded and succeeded by another NA.
This can be accomplished by zoo::rollapply()
as in G. Grothendieck's answer or by using lag()
& lead()
like in Shree's latest edit.
However, my own benchmark (not posted here to avoid duplication with Shree' benchmark) shows that data.table::shift()
and Reduce()
is the fastest method so far.
isnax <- is.na(x)
x[Reduce(`&`, data.table::shift(isnax, -1:1))] <- 0
x
It is also slightly faster than using lag()
& lead()
(please, note that this differs from Shree's version as is.na()
is only called once):
isnax <- is.na(x)
x[isnax & dplyr::lag(isnax) & dplyr::lead(isnax)] <- 0
x
microbenchmark
to bench
for doing benchmarks because it allows to vary problem sizes easily and to create charts. (I decided not to post the charts because you took the burden to do all the benchmarks, already.) –
Afield Based on the example, I assume what you mean is that if the value is NA and adjacent values in both directions are NA (or in one direction if the value is first or last) then replace the value with 0. Using a centered rolling window of length 3 return TRUE if it is all NA and then replace the TRUE positions with 0. This gives the following one-liner
library(zoo)
replace(x, rollapply(c(TRUE, is.na(x), TRUE), 3, all), 0)
## [1] 3 4 NA 0 NA 3 3
zoo
has some pretty handy functions to work with missing values –
Solomon Here's a "stupidly simple" solution:
is_na <- is.na(x) # Vector telling you whether each position in x is NA
na_before <- c(F,is_na[1:(length(x)-1)]) # Whether each position has an NA before it
na_after <- c(is_na[2:length(x),F) # Whether each position has an NA after it
x[is_na & na_before & na_after] <- 0 # Set to 0 if all three are true
The creation of na_before and na_after are based on shifting one to the right or one to the left. To illustrate how this works, consider the letters below (I'm writing T and F as 1 and 0 to make them easier to distinguish):
A B C D E is_vowel 1 0 0 0 1 vowel_before 0 1 0 0 0 vowel_after 0 0 0 1 0
When you make vowel_before, you take the "10001" sequence of is_vowel and shift it one to the right (because each letter is now referring to the letter to its left). You drop the last 1 (you don't care that F has a vowel before it, because F isn't included) and you add a 0 at the beginning (the first letter has no letter before it, and therefore can't have a vowel before it). vowel_after is created with the same logic.
Edit. (Added by Rui Barradas)
This solution is, according to my benchmark, the fastest.
As a function:
TiredSquirrel <- function(x){
is_na <- is.na(x)
na_before <- c(FALSE, is_na[1:(length(x) - 1)])
na_after <- c(is_na[2:length(x)], FALSE)
x[is_na & na_before & na_after] <- 0
x
}
And the benchmark.
x <- c(3, 4, NA, NA, NA, 3, 3)
r <- na2zero(x)
all.equal(r, TiredSquirrel(x))
#[1] TRUE
x <- sample(x, 1e3, TRUE)
r <- na2zero(x)
all.equal(r, TiredSquirrel(x))
#[1] TRUE
microbenchmark(
Rui = na2zero(x),
Uwe_Reduce = Uwe_Reduce(x),
TiredSquirrel = TiredSquirrel(x)
)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# Rui 3134.293 3198.8180 3365.70736 3263.7980 3391.7900 5593.111 100 b
# Uwe_Reduce 99.895 104.3510 125.81417 113.9995 146.7335 244.280 100 a
# TiredSquirrel 65.205 67.4365 72.41129 70.6430 75.8315 122.061 100 a
© 2022 - 2024 — McMap. All rights reserved.