add exact proportion of random missing values to data.frame
Asked Answered
N

2

6

I would like to add random NA to a data.frame in R. So far I've looked into these questions:

R: Randomly insert NAs into dataframe proportionaly

How do I add random NAs into a data frame

add random missing values to a complete data frame (in R)

Many solutions were provided here, but I couldn't find one that comply with these 5 conditions:

  • Add really random NA, and not the same amount by row or by column
  • Work with every class of variable that one can encounter in a data.frame (numeric, character, factor, logical, ts..), so the output must have the same format as the input data.frame or matrix.
  • Guarantee an exact number or proportion [note] of NA in the output (many solutions result in a smaller number of NA since several are generated at the same place)
  • Is computationnaly efficient for big datasets.
  • Add the proportion/number of NA independently of already present NA in the input.

Anyone has an idea? I have already tried to write a function to do this (in an answer of the first link) but it doesn't comply with points N°3&4. Thanks.

[note] the exact proportion, rounded at +/- 1NA of course.

Nigrescent answered 15/9, 2016 at 14:34 Comment(8)
Can you elaborate on how this answer is not enough for you ? (And at least, checking the proportion of NA and redoing another pass with the missing percentage should do also)Sean
@Sean yes thanks that's what I mean, i would like to output directly the right proportion/number of NA. If you can modify your suggestion to comply to this I would be gladNigrescent
I can't, it's an existing answer, I don't see how to elaborate more on it. I don't get the need to have precisely a % (which in itself is roughly a nonsense). Getting the proprotion of NA is easy (sum(is.na(df) / (nrow(df)*ncol(df)) ) and cheking if it's in an acceptable range, if not, do the NA adding again.Sean
@Sean well, that's precisely why I ask a separate question that you are welcome to answer ;-)Nigrescent
Can't help but notice agenis put up an answer for the question you mentioned. Did your own answer to that not work on larger datasets? As it appears you had a way of proportionally adding NAs from that.Illconsidered
@Illconsidered indeed, some months ago yes, but this code of mine doesn't guarantee the exact amount of NA, which I need now :-) (and is really not efficient for big datasets btw)Nigrescent
As Tensibai noted, when sampling for "True Random" guaranteeing a value would be counter intuitive and steps into Pseudo-randomness and you are forcing towards a specific outcome. Just but a checker as they mentioned and run and re-run until you are satisfied.Illconsidered
I'm sorry if you don't understand the reason of my question, I do a lot of model comparisons that also have to impute the data and this is really what I need. But I understand what you say, I could also design a simulation where the number of NA is also random, but this is not currently what I'm running. Maybe i'll study that another time :-)Nigrescent
N
7

This is the way that I do it for my paper on library(imputeMulti) which is currently in review at JSS. This inserts NA's into a random percentage of the whole dataset and scales well, It doesn't guarantee an exact number because of the case of n * p * pctNA %% 1 != 0.

createNAs <- function (x, pctNA = 0.1) {
  n <- nrow(x)
  p <- ncol(x)
  NAloc <- rep(FALSE, n * p)
  NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
  x[matrix(NAloc, nrow = n, ncol = p)] <- NA
  return(x)
}

Obviously you should use a random seed for reproducibility, which can be specified before the function call.

This works as a general strategy for creating baseline datasets for comparison across imputation methods. I believe this is what you want, although your question (as noted in the comments) isn't clearly stated.

Edit: I do assume that x is complete. So, I'm not sure how it would handle existing missing data. You could certainly modify the code if you want, though that would probably increase the runtime by at least O(n*p)

Necaise answered 15/9, 2016 at 17:31 Comment(2)
Thanks. That's exactly what I need. I edited my question to be clearer and mention the rounding in the case of a proportion.Nigrescent
Hi @Alex W I allowed myself to add an answer expanding from your function, to treat the case with already present NA in the dataframe. You can take a look,Nigrescent
N
1

Some users reported that Alex's answer did not address condition N°5 of my question. Indeed, when adding random NA on a dataframe that already contains missing values, the new ones will sometimes fall on the initial ones, and the final proportion will be somewhere between initial proportion and desired proportion... So I expand on Alex's function to comply with all 5 conditions:

I modify his createNAs function so that it enables one of 3 options:

  • option complement: complement with NA up to the desired %
  • option add : add % of NA in addition to those already present
  • option none : add a % of NA regardless of those already present

For option 1 and 2, the function will work recursively until reached the desired proportion of NA:

createNAs <- function (x, pctNA = 0.0, option = "add"){
  prop.NA = function(x) sum(is.na(x))/prod(dim(x))
  initial.pctNA = prop.NA(x)

  if (  (option =="complement") & (initial.pctNA > pctNA)  ){
    message("The data already had more NA than the target percentage. Returning original data")
    return(x)
  }

  if (  (option == "none") || (initial.pctNA == 0)  ){
    n <- nrow(x)
    p <- ncol(x)
    NAloc <- rep(FALSE, n * p)
    NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
    x[matrix(NAloc, nrow = n, ncol = p)] <- NA
    return(x)
  } else { # if another option than none:
    target = ifelse(option=="complement", pctNA, pctNA + initial.pctNA)
    while (prop.NA(x) < target) {
      prop.remaining.to.add = target - prop.NA(x)
      x = createNAs(x, prop.remaining.to.add, option = "none")
    }
    return(x)
  }
}
Nigrescent answered 6/8, 2019 at 9:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.