Conditionally replace missing values depending on surrounding non-missing values

Asked 6/4, 2018 at 15:28 Answered 8/4, 2018 at 11:31

I am trying to replace missing values (NA) in a vector. NA between two equal number is replaced by that number. NA between two different values, should stay NA. For example, given vector "a", I want it to be "b".

a = c(1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3)
b = c(1, 1, 1, 1, 1, NA, NA, NA, 2, 2, 2, 2, 3, 3, 3, 3)

As you can see, the second run of NA, between the values 1 and 2, is not replaced.

Is there a way to vectorize the calculation?

Westward answered 6/4, 2018 at 15:28 Comment(2)

If there are NA at the beginning or end of the vector, do they stay NA? – Awl 6/4, 2018 at 15:41

Yes. they stay NA. – Westward 6/4, 2018 at 15:53

You may use convenience functions from zoo package. Here we replace NA in the original vector where interpolated values (create by na.approx) equals the 'last observations carried forward' (created by na.locf):

library(zoo)
a_ap <- na.approx(a)
a_locf <- na.locf(a)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
a
# [1]  1  1  1  1  1 NA NA NA  2  2  2  2  3  3  3  3

To account for leading and trailing NA, add na.rm = FALSE:

a <- c(NA, 1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3, NA)

a_ap <- na.approx(a, na.rm = FALSE)
a_locf <- na.locf(a, na.rm = FALSE)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
a
# [1] NA  1  1  1  1  1 NA NA NA  2  2  2  2  3  3  3  3 NA

Merengue answered 8/4, 2018 at 10:0 Comment(0)

OP asked for a vecgorized solution, so here's a possible vectorized base R solution (without for loops) that also handles situations with leading/lagging NAs

# Define a vector with Leading/Lagging NAs
a <- c(NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3, NA, NA)

# Save the boolean vector as we are going to reuse it a lot
na_vals <- is.na(a)

# Find the NAs location compared to the non-NAs
ind <- findInterval(which(na_vals), which(!na_vals))

# Find the consecutive values that equal
ind2 <- which(!diff(a[!na_vals]))

# Fill only NAs between equal consequtive files
a[na_vals] <- a[!na_vals][ind2[match(ind, ind2)]]
a
# [1] NA NA  1  1  1  1  1 NA NA NA  2  2  2  2  3  3  3  3 NA NA

Some time comparisons for big vectors

# Create a big vector
set.seed(123)
a <- sample(c(NA, 1:5), 5e7, replace = TRUE)

############################################
##### Cainã Max Couto-Silva

fill_data <- function(vec) {

  for(l in unique(vec[!is.na(vec)])) {

    g <- which(vec %in% l)

    indexes <- list()

    for(i in 1:(length(g) - 1)) {
      indexes[[i]] <- (g[i]+1):(g[i+1]-1)
    }

    for(i in 1:(length(g) - 1)) { 
      if(all(is.na(vec[indexes[[i]]]))) {
        vec[indexes[[i]]] <- l
      }
    }
  }

  return(vec)
}

system.time(res <- fill_data(a))
#   user  system elapsed 
#  81.73    4.41   86.48 

############################################
##### Henrik

system.time({
  a_ap <- na.approx(a, na.rm = FALSE)
  a_locf <- na.locf(a, na.rm = FALSE)
  a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
})
#  user  system elapsed 
# 12.55    3.39   15.98 

# Validate
identical(res, as.integer(a))
# [1] TRUE

############################################
##### David

## Recreate a as it been overridden
set.seed(123)
a <- sample(c(NA, 1:5), 5e7, replace = TRUE)

system.time({
  # Save the boolean vector as we are going to reuse it a lot
  na_vals <- is.na(a)

  # Find the NAs location compaed to the non-NAs
  ind <- findInterval(which(na_vals), which(!na_vals))

  # Find the consecutive values that equl
  ind2 <- which(!diff(a[!na_vals]))

  # Fill only NAs between equal consequtive files
  a[na_vals] <- a[!na_vals][ind2[match(ind, ind2)]]
})
# user  system elapsed 
# 3.39    0.71    4.13 

# Validate
identical(res, a)
# [1] TRUE

Upchurch answered 8/4, 2018 at 11:31 Comment(0)

You can make a function like that:

fill_data <- function(vec) {

  for(l in unique(vec[!is.na(vec)])) {

    g <- which(vec %in% l)

    indexes <- list()

    for(i in 1:(length(g) - 1)) {
      indexes[[i]] <- (g[i]+1):(g[i+1]-1)
    }

    for(i in 1:(length(g) - 1)) { 
      if(all(is.na(vec[indexes[[i]]]))) {
        vec[indexes[[i]]] <- l
      }
    }
  }

  return(vec)
}

Running function:

a = c(1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3)

fill_data(a)
[1]  1  1  1  1  1 NA NA NA  2  2  2  2  3  3  3  3

If you have a vector with values in different places it also works:

ab = c(1, NA, NA, NA, 1, NA, NA, NA, 1, NA, 2, NA, NA, NA, 2, NA , 1, NA, 1, 3, NA, NA, 3)

fill_data(ab)
[1]  1  1  1  1  1  1  1  1  1 NA  2  2  2  2  2 NA  1  1  1  3  3  3  3

Explanation:

First, you find the unique non-NA values.

Then it takes the indexes of each unique non-NA value and acquires the values between them;

Then it tests if these values are all NAs and, if they are, replace them by the level's value.

Curb answered 6/4, 2018 at 17:16 Comment(4)

Technically doesn't fit OP's requirements, since output is a character vector no matter what. If you add vec_cls <- class(vec) at the very beginning of the function and class(vec) <- vec_cls just before the return that would fix it. Or even better would be to use unique(vec[!is.na(vec)]) instead of levels(factor(vec)) and then g <- which(vec %in% l) instead of grep. – Awl 6/4, 2018 at 17:47

Yeah, I noticed the character vector returned, but I think it would not be a problem at all since it would be easily fixed afterwards by res <- as.cls(res). Surely is more practical to fix that inside the function. Feel free for editing the answer (it's my first answer here). Kind regards. – Springs 6/4, 2018 at 18:9

I like the edit that avoids the whole problem by not going to levels and grep. Would be more efficient probably too. – Awl 6/4, 2018 at 18:10

This very un-vectorized, while OP asked explicitly for a vectorized solution – Upchurch 8/4, 2018 at 10:32

library(zoo)
a_ap <- na.approx(a)
a_locf <- na.locf(a)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
a
# [1]  1  1  1  1  1 NA NA NA  2  2  2  2  3  3  3  3

To account for leading and trailing NA, add na.rm = FALSE:

a <- c(NA, 1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3, NA)

a_ap <- na.approx(a, na.rm = FALSE)
a_locf <- na.locf(a, na.rm = FALSE)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
a
# [1] NA  1  1  1  1  1 NA NA NA  2  2  2  2  3  3  3  3 NA

Merengue answered 8/4, 2018 at 10:0 Comment(0)

Recommended topics

Hot tags