Replacing NAs with latest non-NA value
Asked Answered
F

22

212

In a data.frame (or data.table), I would like to "fill forward" NAs with the closest previous non-NA value. A simple example, using vectors (instead of a data.frame) is the following:

> y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)

I would like a function fill.NAs() that allows me to construct yy such that:

> yy
[1] NA NA NA  2  2  2  2  3  3  3  4  4

I need to repeat this operation for many (total ~1 Tb) small sized data.frames (~30-50 Mb), where a row is NA is all its entries are. What is a good way to approach the problem?

The ugly solution I cooked up uses this function:

last <- function (x){
    x[length(x)]
}    

fill.NAs <- function(isNA){
if (isNA[1] == 1) {
    isNA[1:max({which(isNA==0)[1]-1},1)] <- 0 # first is NAs 
                                              # can't be forward filled
}
isNA.neg <- isNA.pos <- isNA.diff <- diff(isNA)
isNA.pos[isNA.diff < 0] <- 0
isNA.neg[isNA.diff > 0] <- 0
which.isNA.neg <- which(as.logical(isNA.neg))
if (length(which.isNA.neg)==0) return(NULL) # generates warnings later, but works
which.isNA.pos <- which(as.logical(isNA.pos))
which.isNA <- which(as.logical(isNA))
if (length(which.isNA.neg)==length(which.isNA.pos)){
    replacement <- rep(which.isNA.pos[2:length(which.isNA.neg)], 
                                which.isNA.neg[2:max(length(which.isNA.neg)-1,2)] - 
                                which.isNA.pos[1:max(length(which.isNA.neg)-1,1)])      
    replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
} else {
    replacement <- rep(which.isNA.pos[1:length(which.isNA.neg)], which.isNA.neg - which.isNA.pos[1:length(which.isNA.neg)])     
    replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
}
replacement
}

The function fill.NAs is used as follows:

y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
isNA <- as.numeric(is.na(y))
replacement <- fill.NAs(isNA)
if (length(replacement)){
which.isNA <- which(as.logical(isNA))
to.replace <- which.isNA[which(isNA==0)[1]:length(which.isNA)]
y[to.replace] <- y[replacement]
} 

Output

> y
[1] NA  2  2  2  2  3  3  3  4  4  4

... which seems to work. But, man, is it ugly! Any suggestions?

Fontainebleau answered 12/10, 2011 at 5:27 Comment(1)
From other questions since this one, I think you've now found roll=TRUE in data.table.Chair
E
207

You probably want to use the na.locf() function from the zoo package to carry the last observation forward to replace your NA values.

Here is the beginning of its usage example from the help page:

library(zoo)

az <- zoo(1:6)

bz <- zoo(c(2,NA,1,4,5,2))

na.locf(bz)
1 2 3 4 5 6 
2 2 1 4 5 2 

na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6 
2 1 1 4 5 2 

cz <- zoo(c(NA,9,3,2,3,2))

na.locf(cz)
2 3 4 5 6 
9 3 2 3 2 
Excurvature answered 12/10, 2011 at 5:32 Comment(3)
Also note that na.locf in zoo works with ordinary vectors as well as zoo objects. Its na.rm argument can be useful in some applications.Assemble
Use na.locf(cz, na.rm=FALSE) to keep leading NA.Belgian
@Belgian 's comment is important and should be included in the answer. Thanks!Stormystorting
M
77

Sorry for digging up an old question. I couldn't look up the function to do this job on the train, so I wrote one myself.

I was proud to find out that it's a tiny bit faster.
It's less flexible though.

But it plays nice with ave, which is what I needed.

repeat.before = function(x) {   # repeats the last non NA value. Keeps leading NA
    ind = which(!is.na(x))      # get positions of nonmissing values
    if(is.na(x[1]))             # if it begins with a missing, add the 
          ind = c(1,ind)        # first position to the indices
    rep(x[ind], times = diff(   # repeat the values at these indices
       c(ind, length(x) + 1) )) # diffing the indices + length yields how often 
}                               # they need to be repeated

x = c(NA,NA,'a',NA,NA,NA,NA,NA,NA,NA,NA,'b','c','d',NA,NA,NA,NA,NA,'e')  
xx = rep(x, 1000000)  
system.time({ yzoo = na.locf(xx,na.rm=F)})  
## user  system elapsed   
## 2.754   0.667   3.406   
system.time({ yrep = repeat.before(xx)})  
## user  system elapsed   
## 0.597   0.199   0.793   

Edit

As this became my most upvoted answer, I was reminded often that I don't use my own function, because I often need zoo's maxgap argument. Because zoo has some weird problems in edge cases when I use dplyr + dates that I couldn't debug, I came back to this today to improve my old function.

I benchmarked my improved function and all the other entries here. For the basic set of features, tidyr::fill is fastest while also not failing the edge cases. The Rcpp entry by @BrandonBertelsen is faster still, but it's inflexible regarding the input's type (he tested edge cases incorrectly due to a misunderstanding of all.equal).

If you need maxgap, my function below is faster than zoo (and doesn't have the weird problems with dates).

I put up the documentation of my tests.

new function

repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
    if (!forward) x = rev(x)           # reverse x twice if carrying backward
    ind = which(!is.na(x))             # get positions of nonmissing values
    if (is.na(x[1]) && !na.rm)         # if it begins with NA
        ind = c(1,ind)                 # add first pos
    rep_times = diff(                  # diffing the indices + length yields how often
        c(ind, length(x) + 1) )          # they need to be repeated
    if (maxgap < Inf) {
        exceed = rep_times - 1 > maxgap  # exceeding maxgap
        if (any(exceed)) {               # any exceed?
            ind = sort(c(ind[exceed] + 1, ind))      # add NA in gaps
            rep_times = diff(c(ind, length(x) + 1) ) # diff again
        }
    }
    x = rep(x[ind], times = rep_times) # repeat the values at these indices
    if (!forward) x = rev(x)           # second reversion
    x
}

I've also put the function in my formr package (Github only).

Manifestation answered 10/12, 2012 at 22:45 Comment(0)
M
49

a data.table solution:

dt <- data.table(y = c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA))
dt[, y_forward_fill := y[1], .(cumsum(!is.na(y)))]
dt
     y y_forward_fill
 1: NA             NA
 2:  2              2
 3:  2              2
 4: NA              2
 5: NA              2
 6:  3              3
 7: NA              3
 8:  4              4
 9: NA              4
10: NA              4

this approach could work with forward filling zeros as well:

dt <- data.table(y = c(0, 2, -2, 0, 0, 3, 0, -4, 0, 0))
dt[, y_forward_fill := y[1], .(cumsum(y != 0))]
dt
     y y_forward_fill
 1:  0              0
 2:  2              2
 3: -2             -2
 4:  0             -2
 5:  0             -2
 6:  3              3
 7:  0              3
 8: -4             -4
 9:  0             -4
10:  0             -4

this method becomes very useful on data at scale and where you would want to perform a forward fill by group(s), which is trivial with data.table. just add the group(s) to the by clause prior to the cumsum logic.

dt <- data.table(group = sample(c('a', 'b'), 20, replace = TRUE), y = sample(c(1:4, rep(NA, 4)), 20 , replace = TRUE))
dt <- dt[order(group)]
dt[, y_forward_fill := y[1], .(group, cumsum(!is.na(y)))]
dt
    group  y y_forward_fill
 1:     a NA             NA
 2:     a NA             NA
 3:     a NA             NA
 4:     a  2              2
 5:     a NA              2
 6:     a  1              1
 7:     a NA              1
 8:     a  3              3
 9:     a NA              3
10:     a NA              3
11:     a  4              4
12:     a NA              4
13:     a  1              1
14:     a  4              4
15:     a NA              4
16:     a  3              3
17:     b  4              4
18:     b NA              4
19:     b NA              4
20:     b  2              2
Mundane answered 9/8, 2017 at 16:2 Comment(3)
I'm familiar with tidyverse but new to data.table - can I ask you what this does? dt[, y_forward_fill := y[1], .(cumsum(!is.na(y)))] Specifically, y[1] and why .(cumsum(!is.na(y))) forward fills the NAs?Aleenaleetha
HI @TonyDiFranco, how would you suggest someone implement this if the intention is to fill backwards?Hominoid
@JantjeHouten the simplest, though not most efficient, approach would be to reverse the sort order of the data.table, perform a forward fill as indicated, and then reverse once more back to the original orderMundane
A
40

The tidyr package (part of the tidyverse suite of packages) has a simple way to do that:

y = c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)

# first, transform it into a data.frame

df = as.data.frame(y)
   y
1  NA
2   2
3   2
4  NA
5  NA
6   3
7  NA
8   4
9  NA
10 NA

library(tidyr)
fill(df, y, .direction = 'down')
    y
1  NA
2   2
3   2
4   2
5   2
6   3
7   3
8   4
9   4
10  4
Anarchist answered 7/5, 2020 at 10:36 Comment(3)
The downside of this function is that, first the atomic vector has to be created as.data.frame() and that the output is also a data.frame instead of atomic vectorMsg
@Msg It's an upside for my casePongid
This is by far and away the simplest and best answer for this one. Thanks!Billie
D
38

You can use the data.table function nafill, available from data.table >= 1.12.3.

library(data.table)
nafill(y, type = "locf")
# [1] NA  2  2  2  2  3  3  4  4  4

If your vector is a column in a data.table, you can also update it by reference with setnafill:

d <- data.table(x = 1:10, y)
setnafill(d, type = "locf", cols = "y")
d
#      x  y
#  1:  1 NA
#  2:  2  2
#  3:  3  2
#  4:  4  2
#  5:  5  2
#  6:  6  3
#  7:  7  3
#  8:  8  4
#  9:  9  4
# 10: 10  4

If you have NA in several columns...

d <- data.table(x = c(1, NA, 2), y = c(2, 3, NA), z = c(4, NA, 5))
#     x  y  z
# 1:  1  2  4
# 2: NA  3 NA
# 3:  2 NA  5

...you can fill them by reference in one go:

setnafill(d, type = "locf")
d
#    x y z
# 1: 1 2 4
# 2: 1 3 4
# 3: 2 3 5

Note that:

Only double and integer data types are currently [data.table 1.12.6] supported.

The functionality will most likely soon be extended; see the open issue nafill, setnafill for character, factor and other types, where you also find a temporary workaround.

Dilly answered 19/4, 2019 at 16:40 Comment(0)
C
25

Throwing my hat in:

library(Rcpp)
cppFunction('IntegerVector na_locf(IntegerVector x) {
  int n = x.size();

  for(int i = 0; i<n; i++) {
    if((i > 0) && (x[i] == NA_INTEGER) & (x[i-1] != NA_INTEGER)) {
      x[i] = x[i-1];
    }
  }
  return x;
}')

Setup a basic sample and a benchmark:

x <- sample(c(1,2,3,4,NA))

bench_em <- function(x,count = 10) {
  x <- sample(x,count,replace = TRUE)
  print(microbenchmark(
    na_locf(x),
    replace_na_with_last(x),
    na.lomf(x),
    na.locf(x),
    repeat.before(x)
  ), order = "mean", digits = 1)
}

And run some benchmarks:

bench_em(x,1e6)

Unit: microseconds
                    expr   min    lq  mean median    uq   max neval
              na_locf(x)   697   798   821    814   821 1e+03   100
              na.lomf(x)  3511  4137  5002   4214  4330 1e+04   100
 replace_na_with_last(x)  4482  5224  6473   5342  5801 2e+04   100
        repeat.before(x)  4793  5044  6622   5097  5520 1e+04   100
              na.locf(x) 12017 12658 17076  13545 19193 2e+05   100

Just in case:

all.equal(
     na_locf(x),
     replace_na_with_last(x),
     na.lomf(x),
     na.locf(x),
     repeat.before(x)
)
[1] TRUE

Update

For a numeric vector, the function is a bit different:

NumericVector na_locf_numeric(NumericVector x) {
  int n = x.size();
  LogicalVector ina = is_na(x);

  for(int i = 1; i<n; i++) {
    if((ina[i] == TRUE) & (ina[i-1] != TRUE)) {
      x[i] = x[i-1];
    }
  }
  return x;
}
Couscous answered 3/9, 2016 at 8:38 Comment(0)
U
23

Dealing with a big data volume, in order to be more efficient, we can use the data.table package.

require(data.table)
replaceNaWithLatest <- function(
  dfIn,
  nameColNa = names(dfIn)[1]
){
  dtTest <- data.table(dfIn)
  setnames(dtTest, nameColNa, "colNa")
  dtTest[, segment := cumsum(!is.na(colNa))]
  dtTest[, colNa := colNa[1], by = "segment"]
  dtTest[, segment := NULL]
  setnames(dtTest, "colNa", nameColNa)
  return(dtTest)
}
Ultramodern answered 3/3, 2015 at 22:7 Comment(2)
An lapply can be added so it can directly apply it to multiple NA columns: replaceNaWithLatest <- function( dfIn, nameColsNa = names(dfIn)[1] ){ dtTest <- data.table(dfIn) invisible(lapply(nameColsNa, function(nameColNa){ setnames(dtTest, nameColNa, "colNa") dtTest[, segment := cumsum(!is.na(colNa))] dtTest[, colNa := colNa[1], by = "segment"] dtTest[, segment := NULL] setnames(dtTest, "colNa", nameColNa) })) return(dtTest) }Lamppost
At first I was excited by this solution, but it's actually not doing the same thing at all. The question is about filling in 1 data set with another. This answer is just imputation.Pumpkinseed
S
20

This has worked for me:

  replace_na_with_last<-function(x,a=!is.na(x)){
     x[which(a)[c(1,1:sum(a))][cumsum(a)+1]]
  }


> replace_na_with_last(c(1,NA,NA,NA,3,4,5,NA,5,5,5,NA,NA,NA))

[1] 1 1 1 1 3 4 5 5 5 5 5 5 5 5

> replace_na_with_last(c(NA,"aa",NA,"ccc",NA))

[1] "aa"  "aa"  "aa"  "ccc" "ccc"

speed is reasonable too:

> system.time(replace_na_with_last(sample(c(1,2,3,NA),1e6,replace=TRUE)))


 user  system elapsed 

 0.072   0.000   0.071 
Schiller answered 25/5, 2016 at 21:2 Comment(3)
This function doesn't do what you expect when there are leading NAs. replace_na_with_last(c(NA,1:4,NA)) (i.e. they're filled with the following value). This is also the default behaviour of imputeTS::na.locf(x, na.remaining = "rev").Manifestation
better to add a default for this case, slightly different approach: replace_na_with_last<-function(x,p=is.na,d=0)c(d,x)[cummax(seq_along(x)*(!p(x)))+1]Schiller
@NickNassuphis 's answer is short, sweet, not package-dependent, and works well with dplyr pipes!Ramify
J
18

Having a leading NA is a bit of a wrinkle, but I find a very readable (and vectorized) way of doing LOCF when the leading term is not missing is:

na.omit(y)[cumsum(!is.na(y))]

A slightly less readable modification works in general:

c(NA, na.omit(y))[cumsum(!is.na(y))+1]

gives the desired output:

c(NA, 2, 2, 2, 2, 3, 3, 4, 4, 4)

Judijudicable answered 29/1, 2018 at 23:0 Comment(0)
H
17

Try this function. It does not require the ZOO package:

# last observation moved forward
# replaces all NA values with last non-NA values
na.lomf <- function(x) {

    na.lomf.0 <- function(x) {
        non.na.idx <- which(!is.na(x))
        if (is.na(x[1L])) {
            non.na.idx <- c(1L, non.na.idx)
        }
        rep.int(x[non.na.idx], diff(c(non.na.idx, length(x) + 1L)))
    }

    dim.len <- length(dim(x))

    if (dim.len == 0L) {
        na.lomf.0(x)
    } else {
        apply(x, dim.len, na.lomf.0)
    }
}

Example:

> # vector
> na.lomf(c(1, NA,2, NA, NA))
[1] 1 1 2 2 2
> 
> # matrix
> na.lomf(matrix(c(1, NA, NA, 2, NA, NA), ncol = 2))
     [,1] [,2]
[1,]    1    2
[2,]    1    2
[3,]    1    2
Homework answered 11/5, 2014 at 12:1 Comment(1)
To improve it you can add this: if (!anyNA(x)) return(x).Retrospective
S
9

There are a bunch of packages offering na.locf (NA Last Observation Carried Forward) functions:

  • xts - xts::na.locf
  • zoo - zoo::na.locf
  • imputeTS - imputeTS::na.locf
  • spacetime - spacetime::na.locf

And also other packages where this function is named differently.

Stines answered 11/11, 2016 at 5:51 Comment(0)
M
3

Following up on Brandon Bertelsen's Rcpp contributions. For me, the NumericVector version didn't work: it only replaced the first NA. This is because the ina vector is only evaluated once, at the beginning of the function.

Instead, one can take the exact same approach as for the IntegerVector function. The following worked for me:

library(Rcpp)
cppFunction('NumericVector na_locf_numeric(NumericVector x) {
  R_xlen_t n = x.size();
  for(R_xlen_t i = 0; i<n; i++) {
    if(i > 0 && !R_finite(x[i]) && R_finite(x[i-1])) {
      x[i] = x[i-1];
    }
  }
  return x;
}')

In case you need a CharacterVector version, the same basic approach also works:

cppFunction('CharacterVector na_locf_character(CharacterVector x) {
  R_xlen_t n = x.size();
  for(R_xlen_t i = 0; i<n; i++) {
    if(i > 0 && x[i] == NA_STRING && x[i-1] != NA_STRING) {
      x[i] = x[i-1];
    }
  }
  return x;
}')
Mana answered 17/2, 2017 at 22:8 Comment(2)
int n = x.size() and for(int i = 0; i<n; i++) should be replaced by double. In R an vector can be larger than c++ int size.Stines
It looks like this function returns "R_xlen_t". If R is compiled with long vector support, this is defined as ptrdiff_t; if it isn't, it's an int. Thanks for the correction!Mana
A
3

Here is a modification of @AdamO's solution. This one runs faster, because it bypasses the na.omit function. This will overwrite the NA values in vector y (except for leading NAs).

   z  <- !is.na(y)                  # indicates the positions of y whose values we do not want to overwrite
   z  <- z | !cumsum(z)             # for leading NA's in y, z will be TRUE, otherwise it will be FALSE where y has a NA and TRUE where y does not have a NA
   y  <- y[z][cumsum(z)]
Acentric answered 22/3, 2018 at 0:16 Comment(0)
V
3

I want to add a next solution which using the runner r cran package.

library(runner)
y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
fill_run(y, FALSE)
 [1] NA  2  2  2  2  3  3  4  4  4

The whole package is optimized and major of it was written in cpp. Thus offer a great efficiency.

Virgil answered 23/9, 2020 at 21:45 Comment(0)
S
2

An option in base, derive from the answers of @Montgomery-Clift and @AdamO, replacing NA's with latest non-NA value could be:

y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)

i <- c(TRUE, !is.na(y[-1]))
y[i][cumsum(i)]
# [1] NA  2  2  2  2  3  3  4  4  4

When only a few NA exist they could be overwritten with the values of the latest non-NA value instead of creating a new vector.

fillNaR <- function(y) {
  i <- which(is.na(y[-1]))
  j <- which(diff(c(-1L,i)) > 1)
  k <- diff(c(j, length(i) + 1))
  i <- rep(i[j], k)
  `[<-`(y, i + sequence(k), y[i])
}
fillNaR(y)
# [1] NA  2  2  2  2  3  3  4  4  4

When speed is important a loop propagating the last non-NA value in a loop could be written using RCPP. To be flexible on the input type this can be done using a template.

Rcpp::sourceCpp(code=r"(
#include <Rcpp.h>
using namespace Rcpp;

template <int RTYPE>
Vector<RTYPE> FNA(const Vector<RTYPE> y) {
  auto x = clone(y);  //or overwrite original
  LogicalVector isNA = is_na(x);
  size_t i = 0;
  while(isNA[i] && i < x.size()) ++i;
  for(++i; i < x.size(); ++i) if(isNA[i]) x[i] = x[i-1];
  return x;
}

// [[Rcpp::export]]
RObject fillNaC(RObject x) {
  RCPP_RETURN_VECTOR(FNA, x);
}
)")
fillNaC(y)
# [1] NA  2  2  2  2  3  3  4  4  4

Those functions can be used inside lapply to apply them on all columns of a data.frame.

DF[] <- lapply(DF, fillNaC)

Other answers using Rcpp, specialized on a data type, look like the following but are updating also the input vector.

y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)

Rcpp::cppFunction("NumericVector fillNaCN(NumericVector x) {
  for(auto i = x.begin()+1; i < x.end(); ++i) if(*i != *i) *i = *(i-1);
  return x;
}")

fillNaCN(y)
# [1] NA  2  2  2  2  3  3  4  4  4
y
# [1] NA  2  2  2  2  3  3  4  4  4

Benchmark

fillNaR <- function(y) {
  i <- which(is.na(y[-1]))
  j <- which(diff(c(-1L,i)) > 1)
  k <- diff(c(j, length(i) + 1))
  i <- rep(i[j], k)
  `[<-`(y, i + sequence(k), y[i])
}

Rcpp::sourceCpp(code=r"(
#include <Rcpp.h>
using namespace Rcpp;

template <int RTYPE>
Vector<RTYPE> FNA(const Vector<RTYPE> y) {
  auto x = clone(y);  //or overwrite original
  LogicalVector isNA = is_na(x);
  size_t i = 0;
  while(isNA[i] && i < x.size()) ++i;
  for(++i; i < x.size(); ++i) if(isNA[i]) x[i] = x[i-1];
  return x;
}

// [[Rcpp::export]]
RObject fillNaC(RObject x) {
  RCPP_RETURN_VECTOR(FNA, x);
}
)")

repeat.before <- function(x) {   # @Ruben
    ind = which(!is.na(x))
    if(is.na(x[1])) ind = c(1,ind)
    rep(x[ind], times = diff(c(ind, length(x) + 1) ))
}

RB2 <- function(x) {
  ind = which(c(TRUE, !is.na(x[-1])))
  rep(x[ind], diff(c(ind, length(x) + 1)))
}

MC <- function(y) { # @Montgomery Clift
  z  <- !is.na(y)  
  z  <- z | !cumsum(z)
  y[z][cumsum(z)]
}

MC2 <- function(y) {
  z <- c(TRUE, !is.na(y[-1]))
  y[z][cumsum(z)]
}

fill.NAs <- function(x) { # @Valentas
  is_na <- is.na(x)
  x[Reduce(function(i,j) if (is_na[j]) i else j, seq_len(length(x)), accumulate=T)]}

M <- alist(
fillNaR = fillNaR(y),
fillNaC = fillNaC(y),
repeat.before = repeat.before(y),
RB2 = RB2(y),
MC = MC(y),
MC2 = MC2(y),
fill.NAs = fill.NAs(y),
tidyr = tidyr::fill(data.frame(y), y)$y,
zoo = zoo::na.locf(y, na.rm=FALSE),
data.table = data.table::nafill(y, type = "locf"),
data.table2 = with(data.table::data.table(y)[, y := y[1], .(cumsum(!is.na(y)))], y),
imputeTS = imputeTS::na_locf(y, na_remaining = "keep"),
runner = runner::fill_run(y, FALSE),
vctrs = vctrs::vec_fill_missing(y, direction = "down"),
ave = ave(y, cumsum(!is.na(y)), FUN = \(x) x[1])
)

Result

n <- 1e5
set.seed(42); y <- rnorm(n); is.na(y) <- sample(seq_along(y), n/100)
bench::mark(exprs = M)  #1% NA
#   expression         min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
# 1 fillNaR       399.82µs   1.02ms    459.      3.56MB    31.9    230    16
# 2 fillNaC       672.85µs 883.74µs    976.      1.15MB    22.0    488    11
# 3 repeat.before   1.28ms    2.8ms    290.      7.57MB    58.0    145    29
# 4 RB2             1.93ms   3.66ms    229.      9.86MB    57.7    115    29
# 5 MC              1.01ms   1.98ms    289.      5.33MB    37.9    145    19
# 6 MC2            884.6µs   1.96ms    393.      6.09MB    53.5    198    27
# 7 fill.NAs       89.37ms   93.1ms     10.1     4.58MB    13.5      6     8
# 8 tidyr           8.42ms   11.3ms     86.3     1.55MB     5.89    44     3
# 9 zoo             1.83ms   3.19ms    216.      7.96MB    31.9    108    16
#10 data.table     73.91µs 259.71µs   2420.    797.38KB    36.0   1210    18
#11 data.table2    54.54ms  58.71ms     16.9     3.47MB     3.75     9     2
#12 imputeTS      623.69µs   1.07ms    494.      2.69MB    30.0    247    15
#13 runner          1.36ms   1.58ms    586.    783.79KB    10.0    293     5
#14 vctrs         149.98µs 317.14µs   1725.      1.53MB    54.0    863    27
#15 ave           137.87ms 149.25ms      6.53   14.77MB     8.17     4     5

set.seed(42); y <- rnorm(n); is.na(y) <- sample(seq_along(y), n/2)
bench::mark(exprs = M)  #50% NA
#  expression         min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
# 1 fillNaR         2.15ms   3.13ms    217.      7.92MB    59.7    109    30
# 2 fillNaC       949.22µs   1.09ms    728.      1.15MB    28.0    364    14
# 3 repeat.before   1.36ms   1.89ms    287.      4.77MB    49.6    185    32
# 4 RB2             1.64ms   2.44ms    347.      7.06MB    39.9    174    20
# 5 MC              1.48ms   1.92ms    443.      4.77MB    34.0    222    17
# 6 MC2             1.09ms   1.72ms    479.      5.53MB    45.9    240    23
# 7 fill.NAs       93.17ms 104.28ms      9.58    4.58MB     9.58     5     5
# 8 tidyr           7.09ms  10.07ms     96.7     1.55MB     3.95    49     2
# 9 zoo             1.62ms   2.28ms    344.      5.53MB    29.8    173    15
#10 data.table    389.69µs 484.81µs   1225.    797.38KB    14.0    613     7
#11 data.table2    27.46ms  29.32ms     33.4      3.1MB     3.93    17     2
#12 imputeTS        1.71ms    2.1ms    413.      3.44MB    25.9    207    13
#13 runner          1.62ms   1.75ms    535.    783.79KB     7.98   268     4
#14 vctrs         144.92µs 293.44µs   2045.      1.53MB    48.0   1023    24
#15 ave            66.38ms  71.61ms     14.0    10.78MB    10.5      8     6

Depending on how many NA's are filled up either data.table::nafill or vctrs::vec_fill_missing are the fastest.

Sceptic answered 30/5, 2022 at 8:24 Comment(0)
O
1
fill.NAs <- function(x) {is_na<-is.na(x); x[Reduce(function(i,j) if (is_na[j]) i else j, seq_len(length(x)), accumulate=T)]}

fill.NAs(c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA))

[1] NA  2  2  2  2  3  3  4  4  4

Reduce is a nice functional programming concept that may be useful for similar tasks. Unfortunately in R it is ~70 times slower than repeat.before in the above answer.

Onassis answered 24/4, 2019 at 13:3 Comment(0)
A
1

I personally use this function. I do not know how fast or slow it is. But it does its job without having to use libraries.

replace_na_with_previous<-function (vector) {
        if (is.na(vector[1])) 
            vector[1] <- na.omit(vector)[1]
        for (i in 1:length(vector)) {
            if ((i - 1) > 0) {
                if (is.na(vector[i])) 
                    vector[i] <- vector[i - 1]
            }
        }
        return(vector)
    }

if you want to apply this function in a dataframe, if your dataframe is called df then simply

df[]<-lapply(df,replace_na_with_previous)
Askwith answered 21/2, 2020 at 11:36 Comment(0)
S
1

I'm posting this here as this might be helpful for others with problems similar to the asked question.

The most recent tidyverse solution using the vctrs package can be compined with mutate to create a new column

library(dplyr)
library(magrittr)
library(vctrs)

as.data.frame(y) %>%
  mutate(y_filled = vec_fill_missing(y, direction = c("down")) )

Returns

   y  y_filled
1  NA       NA
2   2        2
3   2        2
4  NA        2
5  NA        2
6   3        3
7  NA        3
8   4        4
9  NA        4
10 NA        4

While changing the 'filling direction' to 'up' results in:

    y  y_filled
1  NA        2
2   2        2
3   2        2
4  NA        3
5  NA        3
6   3        3
7  NA        4
8   4        4
9  NA       NA
10 NA       NA

Might wanna also try "downup" or "updown"

Please note that this solution is still in experimental life cycle so the syntax might change.

Showiness answered 23/11, 2021 at 14:2 Comment(0)
E
1

You can use my function roll_na_fill() which is optimised for data consisting of many groups.

Example benchmark

# remotes::install_github("NicChr/timeplyr")

library(timeplyr)
library(vctrs)
library(data.table)
library(zoo)
library(imputeTS)
library(ggplot2)
library(microbenchmark)

x <- sample.int(10^2, 10^5, TRUE)
x[sample.int(10^5, round(10^5/3))] <- NA
groups <- sample.int(10^3, 10^5, TRUE)

dt <- data.table(x, groups)

### No groups

m1  <- microbenchmark(data.table = dt[, filled1 := data.table::nafill(x, type = "locf")][]$filled1,
                      vctrs = dt[, filled3 := vctrs::vec_fill_missing(x)][]$filled3,
                      zoo = dt[, filled4 := zoo::na.locf0(x)][]$filled4,
                      timeplyr = dt[, filled5 := .roll_na_fill(x)][]$filled5,
                      imputeTS = dt[, filled6 := na_locf(x, na_remaining = "keep")][]$filled6,
                      times = 20)
autoplot(m1)

### With groups

m2 <- microbenchmark(data.table = dt[, filled1 := data.table::nafill(x, type = "locf"),
                            by = groups][]$filled1,
            vctrs = dt[, filled3 := vctrs::vec_fill_missing(x), by = groups][]$filled3,
            zoo = dt[, filled4 := zoo::na.locf0(x), by = groups][]$filled4,
            timeplyr1 = dt[, filled5 := .roll_na_fill(x), by = groups][]$filled5,
            timeplyr2 = dt[, filled7 := roll_na_fill(x, g = groups)][]$filled7,
            imputeTS = dt[, filled6 := na_locf(x, na_remaining = "keep"), 
                          by = groups][]$filled6,
            times = 20)
autoplot(m2)

Created on 2023-11-12 with reprex v2.0.2

Eupatrid answered 12/11, 2023 at 17:20 Comment(0)
F
0

I tried the below:

nullIdx <- as.array(which(is.na(masterData$RequiredColumn)))
masterData$RequiredColumn[nullIdx] = masterData$RequiredColumn[nullIdx-1]

nullIdx gets the idx number where ever masterData$RequiredColumn has a Null/ NA value. In the next line we replace it with the corresponding Idx-1 value, i.e. the last good value before each NULL/ NA

Fourscore answered 27/10, 2016 at 16:3 Comment(1)
This doesn't work if there are multiple consecutive missing values - 1 NA NA turns into 1 1 NA. Also, I think the as.array() is unnecessary.Blocker
L
0

This worked for me, although I'm not sure whether it is more efficient than other suggestions.

rollForward <- function(x){
  curr <- 0
  for (i in 1:length(x)){
    if (is.na(x[i])){
      x[i] <- curr
    }
    else{
      curr <- x[i]
    }
  }
  return(x)
}
Lewd answered 13/1, 2017 at 19:16 Comment(0)
S
0

Too late to the party, but a very concise and expandable answer for use with library(data.table) and therefore usable as dt[,SomeVariable:= FunctionBellow, by = list(group)].

library(imputeTS)
y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
y
[1] NA  2  2 NA NA  3 NA  4 NA NA
imputeTS::na_locf(imputeTS::na_locf(y,option = "nocb"),option="locf")
[1] 2 2 2 3 3 3 4 4 4 4
Southerland answered 8/4, 2022 at 17:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.