Replacing NAs in R with nearest value
Asked Answered
S

6

32

I'm looking for something similar to na.locf() in the zoo package, but instead of always using the previous non-NA value I'd like to use the nearest non-NA value. Some example data:

dat <- c(1, 3, NA, NA, 5, 7)

Replacing NA with na.locf (3 is carried forward):

library(zoo)
na.locf(dat)
# 1 3 3 3 5 7

and na.locf with fromLast set to TRUE (5 is carried backwards):

na.locf(dat, fromLast = TRUE)
# 1 3 5 5 5 7

But I wish the nearest non-NA value to be used. In my example this means that the 3 should be carried forward to the first NA, and the 5 should be carried backwards to the second NA:

1 3 3 5 5 7

I have a solution coded up, but wanted to make sure that I wasn't reinventing the wheel. Is there something already floating around?

FYI, my current code is as follows. Perhaps if nothing else, someone can suggest how to make it more efficient. I feel like I'm missing an obvious way to improve this:

  na.pos <- which(is.na(dat))
  if (length(na.pos) == length(dat)) {
    return(dat)
  }
  non.na.pos <- setdiff(seq_along(dat), na.pos)
  nearest.non.na.pos <- sapply(na.pos, function(x) {
    return(which.min(abs(non.na.pos - x)))
  })
  dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]

To answer smci's questions below:

  1. No, any entry can be NA
  2. If all are NA, leave them as is
  3. No. My current solution defaults to the lefthand nearest value, but it doesn't matter
  4. These rows are a few hundred thousand elements typically, so in theory the upper bound would be a few hundred thousand. In reality it'd be no more than a few here & there, typically a single one.

Update So it turns out that we're going in a different direction altogether but this was still an interesting discussion. Thanks all!

Seigniorage answered 9/4, 2012 at 17:53 Comment(6)
did you look at the optional parameters to na.locf? fromLast looks like it may do what you want.Smallage
It doesn't, as it just takes the previous value in the opposite direction. It won't find the nearest non-NA valueSeigniorage
'Mind posting your solution? I'd be curious to see what you've got.Meade
Just did, realized that if nothing else perhaps this could turn into how to make what I'm doing betterSeigniorage
We could iterate over rle(which(is.na(dat))). Not saying that's the most efficient but it's an improvement. See also "How can I count runs in R?" which needs a tweak rle.na() to handle NAs.Hermitage
You may want to look into R and ngrams as your question is some what related to textual ngrams. Just a thought, this is definitely an interesting question.Rhyolite
U
28

Here is a very fast one. It uses findInterval to find what two positions should be considered for each NA in your original data:

f1 <- function(dat) {
  N <- length(dat)
  na.pos <- which(is.na(dat))
  if (length(na.pos) %in% c(0, N)) {
    return(dat)
  }
  non.na.pos <- which(!is.na(dat))
  intervals  <- findInterval(na.pos, non.na.pos,
                             all.inside = TRUE)
  left.pos   <- non.na.pos[pmax(1, intervals)]
  right.pos  <- non.na.pos[pmin(N, intervals+1)]
  left.dist  <- na.pos - left.pos
  right.dist <- right.pos - na.pos

  dat[na.pos] <- ifelse(left.dist <= right.dist,
                        dat[left.pos], dat[right.pos])
  return(dat)
}

And here I test it:

# sample data, suggested by @JeffAllen
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA

# computation times
system.time(r0 <- f0(dat))    # your function
# user  system elapsed 
# 5.52    0.00    5.52
system.time(r1 <- f1(dat))    # this function
# user  system elapsed 
# 0.01    0.00    0.03
identical(r0, r1)
# [1] TRUE
Unoccupied answered 10/4, 2012 at 1:30 Comment(3)
Wow. I think we have a winner. Very nice.Intendant
I believe your line 4 should be if (length(na.pos) == 0)Parasynapsis
Thank you @Morten. Actually, both extreme cases needed early exit. It is fixed.Unoccupied
H
6

Code below. The initial question was not totally well-defined, I had asked for these clarifications:

  1. Is it guaranteed that at least the first and/or last entries are non-NA? [No]
  2. What to do if all entries in a row are NA? [Leave as-is]
  3. Do you care how ties are split i.e. how to treat the middle NA in 1 3 NA NA NA 5 7? [Don't-care/ left]
  4. Do you have an upper-bound (S) on the longest contiguous span of NAs in a row? (I'm thinking a recursive solution if S is small. Or a dataframe solution with ifelse if S is large and number of rows and cols is large.) [worst-case S could be pathologically large, hence recursion should not be used]

geoffjentry, re your solution your bottlenecks will be the serial calculation of nearest.non.na.pos and the serial assignment dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]] For a large gap of length G all we really need to compute is that the first (G/2, round up) items fill-from-left, the rest from right. (I could post an answer using ifelse but it would look similar.) Are your criteria runtime, big-O efficiency, temp memory usage, or code legibility?

Coupla possible tweaks:

  • only need to compute N <- length(dat) once
  • common-case speed enhance: if (length(na.pos) == 0) skip row, since it has no NAs
  • if (length(na.pos) == length(dat)-1) the (rare) case where there is only one non-NA entry hence we fill entire row with it

Outline solution:

Sadly na.locf does not work on an entire dataframe, you must use sapply, row-wise:

na.fill_from_nn <- function(x) {
  row.na <- is.na(x)
  fillFromLeft <- na.locf(x, na.rm=FALSE) 
  fillFromRight <- na.locf(x, fromLast=TRUE, na.rm=FALSE)

  disagree <- rle(fillFromLeft!=fillFromRight)
  for (loc in (disagree)) { ...  resolve conflicts, row-wise }
}

sapply(dat, na.fill_from_nn)

Alternatively, since as you say contiguous NAs are rare, do a fast-and-dumb ifelse to fill isolated NAs from left. This will operate data-frame wise => makes the common-case fast. Then handle all the other cases with a row-wise for-loop. (This will affect the tiebreak on middle elements in a long span of NAs, but you say you don't care.)

Hermitage answered 9/4, 2012 at 19:37 Comment(4)
@joran, the most efficient answer will vary hugely if number of cols is high and S is high too. It will also vary depending on what proportion of rows have no NAs.Hermitage
I was hoping to find something that ran faster. My initial assumption was that I was missing something similar to na.locf() that was much faster than what I had.Seigniorage
@geoffjentry: Add the common-case speed enhance: if (length(na.pos) == 0) I suggested (how common is that case?)Hermitage
I actually have that in my real code, I took it out as I figured someone would point out that it didn't do much. The sapply() will simply iterate over nothing if na.pos is of length 0.Seigniorage
B
5

I like all the rigorous solutions. Though not directly what was asked, I found this post looking for a solution to filling NA values with an interpolation. After reviewing this post I discovered na.fill on a zoo object(vector, factor, or matrix):

z <- c(1,2,3,4,5,6,NA,NA,NA,2,3,4,5,6,NA,NA,4,6,7,NA)
z1 <- zoo::na.fill(z, "extend")

Note the smooth transition across the NA values

round(z1, 0)
#>  [1] 1 2 3 4 5 6 5 4 3 2 3 4 5 6 5 5 4 6 7 7

Perhaps this could help

Bursar answered 13/4, 2017 at 9:11 Comment(1)
I've slightly edited this answer, because a transformation into a zoo object is not necessary for this solutionTergum
I
4

I can't think of an obvious simple solution, but, having looked at the suggestions (particularly smci's suggestion of using rle) I came up with a complicated function that appears to be more efficient.

This is the code, I'll explain below:

# Your function
your.func = function(dat) {
  na.pos <- which(is.na(dat))
  if (length(na.pos) == length(dat)) {
    return(dat)
  }
  non.na.pos <- setdiff(seq_along(dat), na.pos)
  nearest.non.na.pos <- sapply(na.pos, function(x) which.min(abs(non.na.pos - x)))
  dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
  dat
}

# My function
my.func = function(dat) {
    nas=is.na(dat)
    if (!any(!nas)) return (dat)
    t=rle(nas)
    f=sapply(t$lengths[t$values],seq)
    a=unlist(f)
    b=unlist(lapply(f,rev))
    x=which(nas)
    l=length(dat)
    dat[nas]=ifelse(a>b,dat[ ifelse((x+b)>l,x-a,x+b) ],dat[ifelse((x-a)<1,x+b,x-a)])
    dat
}


# Test
n = 100000
test.vec = 1:n
set.seed(1)
test.vec[sample(test.vec,n/4)]=NA

system.time(t1<-my.func(test.vec))
system.time(t2<-your.func(test.vec)) # 10 times speed improvement on my machine

# Verify
any(t1!=t2)

My function relies on rle. I am reading the comments above but it looks to me like rle works just fine for NA. It is easiest to explain with a small example.

If I start with a vector:

dat=c(1,2,3,4,NA,NA,NA,8,NA,10,11,12,NA,NA,NA,NA,NA,18)

I then get the positions of all the NAs:

x=c(5,6,7,8,13,14,15,16,17)

Then, for every "run" of NAs I create a sequence from 1 to the length of the run:

a=c(1,2,3,1,1,2,3,4,5)

Then I do it again, but I reverse the sequence:

b=c(3,2,1,1,5,4,3,2,1)

Now, I can just compare vectors a and b: If a<=b then look back and grab the value at x-a. If a>b then look ahead and grab the value at x+b. The rest is just handling the corner cases when you have all NAs or NA runs at the end or the start of the vector.

There is probably a better, simpler, solution, but I hope this gets you started.

Intendant answered 9/4, 2012 at 21:52 Comment(2)
@smci: No problem. Is it possible to format usernames so that they are linked?Intendant
Just like any other hyperlink: open-square-bracket name close-square-bracket (URL) Or click 'help' on right.Hermitage
M
2

Here's my stab at it. I never like to see a for loop in R, but in the case of a sparsely-NA vector, it looks like it will actually be more efficient (performance metrics below). The gist of the code is below.

  #get the index of all NA values
  nas <- which(is.na(dat))

  #get the Boolean map of which are NAs, used later to determine which values can be used as a replacement, and which are just filled-in NA values
  namask <- is.na(dat)

  #calculate the maximum size of a run of NAs
  length <- getLengthNAs(dat);

  #the furthest away an NA value could be is half of the length of the maximum NA run
  windowSize <- ceiling(length/2)

  #loop through all NAs
  for (thisIndex in nas){
    #extract the neighborhood of this NA
    neighborhood <- dat[(thisIndex-windowSize):(thisIndex+windowSize)]
    #any already-filled-in values which were NA can be replaced with NAs
    neighborhood[namask[(thisIndex-windowSize):(thisIndex+windowSize)]] <- NA

    #the center of this neighborhood
    center <- windowSize + 1

    #compute the difference within this neighborhood to find the nearest non-NA value
    delta <- center - which(!is.na(neighborhood))

    #find the closest replacement
    replacement <- delta[abs(delta) == min(abs(delta))]
    #in case length > 1, just pick the first
    replacement <- replacement[1]

    #replace with the nearest non-NA value.
    dat[thisIndex] <- dat[(thisIndex - (replacement))]
  }

I liked the code you proposed, but I noticed that we were calculating the delta between every NA value and every other non-NA index in the matrix. I think this was the biggest performance hog. Instead, I just extract the minimum-sized neighborhood or window around each NA and find the nearest non-NA value within that window.

So the performance scales linearly on the number of NAs and the window size -- where the window size is (the ceiling of) half the length of the maximum run of NAs. To calculate the length of the maximum run of NAs, you can use the following function:

getLengthNAs <- function(dat){
  nas <- which(is.na(dat))
  spacing <- diff(nas)
  length <- 1;
  while (any(spacing == 1)){        
    length <- length + 1;
    spacing <- diff(which(spacing == 1))
  }
    length
}

Performance Comparison

#create a test vector with 10% NAs and length 50,000.
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA

#the a() function is the code posted in the question
a <- function(dat){
  na.pos <- which(is.na(dat))
    if (length(na.pos) == length(dat)) {
        return(dat)
    }
    non.na.pos <- setdiff(seq_along(dat), na.pos)
    nearest.non.na.pos <- sapply(na.pos, function(x) {
        return(which.min(abs(non.na.pos - x)))
    })
    dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
    dat
}

#my code
b <- function(dat){
    #the same code posted above, but with some additional helper code to sanitize the input
    if(is.null(dat)){
      return(NULL);
    }

    if (all(is.na(dat))){
      stop("Can't impute NAs if there are no non-NA values.")
    }

    if (!any(is.na(dat))){
      return(dat);
    }

    #starts with an NA (or multiple), handle these
    if (is.na(dat[1])){
      firstNonNA <- which(!is.na(dat))[1]
      dat[1:(firstNonNA-1)] <- dat[firstNonNA]
    }

    #ends with an NA (or multiple), handle these
    if (is.na(dat[length(dat)])){
      lastNonNA <- which(!is.na(dat))
      lastNonNA <- lastNonNA[length(lastNonNA)]
      dat[(lastNonNA+1):length(dat)] <- dat[lastNonNA]
    }

    #get the index of all NA values
    nas <- which(is.na(dat))

    #get the Boolean map of which are NAs, used later to determine which values can be used as a replacement, and which are just filled-in NA values
    namask <- is.na(dat)

    #calculate the maximum size of a run of NAs
    length <- getLengthNAs(dat);

    #the furthest away an NA value could be is half of the length of the maximum NA run
    #if there's a run at the beginning or end, then the nearest non-NA value could possibly be `length` away, so we need to keep the window large for that case.
    windowSize <- ceiling(length/2)

    #loop through all NAs
    for (thisIndex in nas){
      #extract the neighborhood of this NA
      neighborhood <- dat[(thisIndex-windowSize):(thisIndex+windowSize)]
      #any already-filled-in values which were NA can be replaced with NAs
      neighborhood[namask[(thisIndex-windowSize):(thisIndex+windowSize)]] <- NA

      #the center of this neighborhood
      center <- windowSize + 1

      #compute the difference within this neighborhood to find the nearest non-NA value
      delta <- center - which(!is.na(neighborhood))

      #find the closest replacement
      replacement <- delta[abs(delta) == min(abs(delta))]
      #in case length > 1, just pick the first
      replacement <- replacement[1]

      #replace with the nearest non-NA value.
      dat[thisIndex] <- dat[(thisIndex - (replacement))]
    }
    dat
}

#nograpes' answer on this question
c <- function(dat){
  nas=is.na(dat)
  if (!any(!nas)) return (dat)
  t=rle(nas)
  f=sapply(t$lengths[t$values],seq)
  a=unlist(f)
  b=unlist(lapply(f,rev))
  x=which(nas)
  l=length(dat)
  dat[nas]=ifelse(a>b,dat[ ifelse((x+b)>l,x-a,x+b) ],dat[ifelse((x-a)<1,x+b,x-a)])
  dat
}

#run 10 times each to get average performance.
sum <- 0; for (i in 1:10){ sum <- sum + system.time(a(dat))["elapsed"];}; cat ("A: ", sum/10)
A:  5.059
sum <- 0; for (i in 1:10){ sum <- sum + system.time(b(dat))["elapsed"];}; cat ("B: ", sum/10)
B:  0.126
sum <- 0; for (i in 1:10){ sum <- sum + system.time(c(dat))["elapsed"];}; cat ("C: ", sum/10)
C:  0.287

So it looks like this code (at least under these conditions), offers about a 40X speedup from the original code posted in the question, and a 2.2X speedup over @nograpes' answer below (though I imagine an rle solution would certainly be faster in some situations -- including a more NA-rich vector).

Meade answered 10/4, 2012 at 0:10 Comment(1)
After all that, I figure I might as well package it up. The complete code, along with RUnit tests to verify correct behavior is available on GitHub: github.com/trestletech/R-UtilsMeade
R
1

Speed is about 3-4x slower than that of the chosen answer. Mine is pretty simple though. It's a rare while loop too.

f2 <- function(x){

  # check if all are NA to skip loop
  if(!all(is.na(x))){

    # replace NA's until they are gone
    while(anyNA(x)){

      # replace from the left
      x[is.na(x)] <- c(NA,x[1:(length(x)-1)])[is.na(x)]

      # replace from the right
      x[is.na(x)] <- c(x[-1],NA)[is.na(x)]
    }
  }

  # return original or fixed x
  x
}
Retrocede answered 5/8, 2015 at 5:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.