generate random sequences of NA of random lengths in a vector
Asked Answered
S

7

10

I want to generate missing values in a vector so that the missing value are grouped in sequences, to simulate periods of missing data of different length.

Let's say I have a vector of 10 000 values and I want to generate 12 sequences of NA at random locations in the vector, each sequence having a random length L between 1 and 144 (144 simulates 2 days of missing values at timestep 10 minutes). The sequences must not overlap.

How can I do that? Thanks.

I tried combining lapply and seq without success.

An example expected output with 3 distinct sequences:

# 1 2 3 5 2 NA NA 5 4 6 8 9 10 11 NA NA NA NA NA NA 5 2 NA NA NA...

EDIT

I'm dealing with a seasonal time series so the NA must overwrite values and not be inserted as new elements.

Sandusky answered 16/6, 2017 at 13:23 Comment(0)
S
6

If both the starting position and the run-length of each NA-sequence is supposed to be random I think you cannot be sure to immediately find a fitting solution, since your constraint is that the sequences must not overlap.

Therefore I propose the following solution which tries up to a limited number of times (max_iter) to find a fitting combination of starting positions and NA-run-lengths. If one is found, it is returned, if none is found within the defined maximum number of iterations, you'll just get a notice returned.

x = 1:1000
n = 3
m = 1:144

f <- function(x, n, m, max_iter = 100) {
  i = 0
  repeat {
    i = i+1
    idx <- sort(sample(seq_along(x), n))        # starting positions
    dist <- diff(c(idx, length(x)))             # check distance inbetween 
    na_len <- sample(m, n, replace = TRUE) - 1L # lengths of NA-runs
    ok <- all(na_len < dist)                    # check overlap
    if(ok | i == max_iter) break 
  }

  if(ok) {
    replace(x, unlist(Map(":", idx, idx+na_len)), NA)
  } else {
      cat("no solution found in", max_iter, "iterations")
    }
}

f(x, n, m, max_iter = 20)

Of course you can increase the number of iterations easily and you should note that with larger n it's increasingly difficult (more iterations required) to find a solution.

Sielen answered 16/6, 2017 at 14:11 Comment(2)
nice, thanks for the function, it seems that it meets my criteriaSandusky
@Sandusky a minor correction: I think you need to subtract 1 from each na_len otherwise the NA runs may vary between 2 and 145 instead of 1 and 144.Sielen
S
6

All other answers more or less follow a "conditional specification" where starting index and run length of the NA chunks are simulated. However, as non-overlapping condition must be satisfied these chunks have to be determined one by one. Such dependence prohibits vectorization, and either for loop or lapply / sapply must be used.

However, this problem is just another run length problem. 12 non-overlapping NA chunks would divide the whole sequence into 13 non-missing chunks (yep, I guess this is what OP wants as missing chunks occurring as the first chunk or the last chunk is not interesting). So why not think of the following:

  • generate run length of 12 missing chunks;
  • generate run length of 13 non-missing chunks;
  • interleave these two type of chunks.

The second step looks difficult as it must satisfy that length of all chunks sums up to a fixed number. Well, multinomial distribution is just for this.

So here is a fully vectorized solution:

# run length of 12 missing chunks, with feasible length between 1 and 144
k <- sample.int(144, 12, TRUE)

# run length of 13 non-missing chunks, summing up to `10000 - sum(k)`
# equal probability is used as an example, you may try something else
m <- c(rmultinom(1, 10000 - sum(k), prob = rep.int(1, 13)))

# interleave `m` and `k`
n <- c(rbind(m[1:12], k), m[13])

# reference value: 1 for non-missing and NA for missing, and interleave them
ref <- c(rep.int(c(1, NA), 12), 1)

# an initial vector
vec <- rep.int(ref, n)

# missing index
miss <- is.na(vec)

We can verify that sum(n) is 10000. What's next? Feel free to fill in non-missing entries with random integers maybe?


My initial answer may be too short to follow, thus the above expansion is taken.

It is straightforward to write a function implementing the above, with user input, in place of example parameter values 12, 144, 10000.

Note, the only potential problem of multinomial, is that under some bad prob, it could generate some zeros. Thus, some NA chunks will in fact join together. To get around this, a robust check is as such: replace all 0 to 1, and subtract the inflation of such change from the max(m).

Seka answered 16/6, 2017 at 13:54 Comment(0)
S
6

If both the starting position and the run-length of each NA-sequence is supposed to be random I think you cannot be sure to immediately find a fitting solution, since your constraint is that the sequences must not overlap.

Therefore I propose the following solution which tries up to a limited number of times (max_iter) to find a fitting combination of starting positions and NA-run-lengths. If one is found, it is returned, if none is found within the defined maximum number of iterations, you'll just get a notice returned.

x = 1:1000
n = 3
m = 1:144

f <- function(x, n, m, max_iter = 100) {
  i = 0
  repeat {
    i = i+1
    idx <- sort(sample(seq_along(x), n))        # starting positions
    dist <- diff(c(idx, length(x)))             # check distance inbetween 
    na_len <- sample(m, n, replace = TRUE) - 1L # lengths of NA-runs
    ok <- all(na_len < dist)                    # check overlap
    if(ok | i == max_iter) break 
  }

  if(ok) {
    replace(x, unlist(Map(":", idx, idx+na_len)), NA)
  } else {
      cat("no solution found in", max_iter, "iterations")
    }
}

f(x, n, m, max_iter = 20)

Of course you can increase the number of iterations easily and you should note that with larger n it's increasingly difficult (more iterations required) to find a solution.

Sielen answered 16/6, 2017 at 14:11 Comment(2)
nice, thanks for the function, it seems that it meets my criteriaSandusky
@Sandusky a minor correction: I think you need to subtract 1 from each na_len otherwise the NA runs may vary between 2 and 145 instead of 1 and 144.Sielen
B
5

EDIT: Just for fun here's a shorter recursive version of my solution below

add_nas <- function(v,n_seq = 12,min_l_seq = 1,max_l_seq = 144){
  insert_length  <- sample(min_l_seq:max_l_seq,1)
  insert_pos     <- sample(length(v)-insert_length,1)
  v <- v[-(insert_pos+(1:insert_length)-1)]
  if(n_seq > 1){v <- add_nas(v,n_seq-1,min_l_seq,max_l_seq)}
  append(v,rep(NA,insert_length),insert_pos-1)
}

Old answer:

# we build a vextor of 20 values
v <- sample(1:100,20,replace=TRUE) # your vector
# your parameters
n_seq <- 3     # you put 12 here
min_l_seq <- 1 #
max_l_seq <- 5 # you put 144 here

# first we will delete items, then we add NAs where we deleted instead
insert_lengths <- sample(min_l_seq:max_l_seq,n_seq,replace=TRUE)
lengths_before_deletion <- length(v)- c(0,insert_lengths[-length(insert_lengths)])
insert_pos <- sapply(lengths_before_deletion-insert_lengths+1,function(x){sample(1:x,1)})

v2 <- v
print(v)
for (i in 1:n_seq){
  v2 <- v2[-(insert_pos[i]:(insert_pos[i]+insert_lengths[i]-1))]
  print(v2)
}

for (i in n_seq:1){
  v2 <- c(v2[1:(insert_pos[i]-1)],rep(NA,insert_lengths[i]),v2[insert_pos[i]:length(v2)])
  print(v2)
}

here's the log

> print(v)
 [1] 75 11  4 19 55 20 65 48 85 20 61 16 75 31 50 10 30 61  4 32
> for (i in 1:n_seq){
+   v2 <- v2[-(insert_pos[i]:(insert_pos[i]+insert_lengths[i]-1))]
+   print(v2)
+ }
 [1] 75 11 55 20 65 48 85 20 61 16 75 31 50 10 30 61  4 32
 [1] 75 11 55 20 65 48 85 20 61 16 75 50 10 30 61  4 32
 [1] 75 11 55 20 65 48 85 20 61 16 75 50 10 30 32
> 
> for (i in n_seq:1){
+   v2 <- c(v2[1:(insert_pos[i]-1)],rep(NA,insert_lengths[i]),v2[insert_pos[i]:length(v2)])
+   print(v2)
+ }
 [1] 75 11 55 20 65 48 85 20 61 16 75 50 10 30 NA NA 32
 [1] 75 11 55 20 65 48 85 20 61 16 75 NA 50 10 30 NA NA 32
 [1] 75 11 NA NA 55 20 65 48 85 20 61 16 75 NA 50 10 30 NA NA 32
Beliabelial answered 16/6, 2017 at 13:58 Comment(2)
NOTE: as requested sequences of NA will not overlap, but they may come one right after another, though that's unlikely in your datasetBeliabelial
The solution is not completely random, the consecutive removals may remove on both side of previous cuts, thus creating artificially more cases of sequences of 288 (still rare though!). It may still work for you though, every output fits your requirements, it's just that some possibilities will have an extra probability of appearing.Beliabelial
B
3

Here is my revised version:

while(1){
  na_span_vec <- sample((10000-143), 12) %>% sort 
  if(min(na_span_vec - lag(na_span_vec), na.rm = T) > 144) break
}
na_idx <- na_span_vec %>% as.list %>% 
  lapply(function(x) seq(x, x + sample(143, 1))) %>% unlist
original_vec[na_idx] <- NA
Baccy answered 16/6, 2017 at 13:43 Comment(4)
When I run your code, it seems to generate sequences longer than 144.Sideways
cool tricks in the code... I didn't think of sampling both start and end points. However with this method I get a vector with 50% of NA in average. Unfortunately it's way too much for my algorithm, that's why I wanted to limit the length to 144Sandusky
oops I misunderstand what you say. I will reviseBaccy
Add this bit ? insert_data <- data.frame(size = sample(1:144, 12), start = sample(seq(1,10000, by = 144),12)) insert_data$end <- insert_data$start + insert_data$size na_span_matrix <- insert_data[,2:3] %>% tHuppah
H
3

You could use this function:

genVecLength<-function(vec,namin,namax,nanumber) {
    nalengths<-sample(namin:namax,nanumber,replace=TRUE)
    vec[sort(sample(nanumber*2+1,length(vec),replace=TRUE))%%2==0]<-NA
    vec
}

where vec is your original vector, namin and namax are the minimum and maximum length of the NA sequence and nanumber is the number of sequences.

An example:

set.seed(1)
genVecLength(1:30,namin=1,namax=5,nanumber=3)
#[1]  1  2  3 NA NA NA NA NA  9 10 11 12 13 NA NA NA 17 18 19 20 21 NA NA NA 25
#[26] 26 27 28 29 30

For your example, if vec<-runif(10000), you could try:

genVecLength(vec,1,144,12)
Hexad answered 16/6, 2017 at 13:56 Comment(2)
as I understood OP wanted to simulate gaps from complete data, not inserting NAs like you didBeliabelial
You might be right. I'm gonna delete this answer and see if I can find a fix.Hexad
L
2

Here is a simple idea. Randomly cut non-na part into 13 piece (some piece may have 0 length, it's all right since we can reserve one non-na position at the end for each 11 NA sequence), and insert generated 12 NA sequence between them. So 12 NA seq without overlap in a length 10000 vector means there are 10000 - sum(length(NA.seq)) - 11 non-na position (11 is the reserved non-na position at the end of 11 NA sequence.

orig.seq = 1:10000
na.len = sapply(1:12, function(x) sample(1:144, 1)) # na sequence length
na.len[1:11] = na.len[1:11] + 1 #reserve one non-na position for first 11 NA seq
avail.space = 10000 - sum(na.len) # number of non-na position to cut (sum(na.len) includes the reserved one non-na position)
avail.space.loc = sample(0:avail.space, 12) %>% sort # find 12 cut point to split it into 13 piece
end = avail.space.loc + cumsum(na.len)
start = end - na.len
for (i in 1:12) {
    if (i != 12) {
        orig.seq[start[i]:end[i]-1] <- NA # recover the reserved non-na position
    } else orig.seq[start[i]:end[i]] <- NA
}
Ladonnalady answered 16/6, 2017 at 14:21 Comment(1)
nice idea! seems it meets all the critariaSandusky
R
1
 #just a vector of 10000 values (uniform distribution)
 initVec <- runif(10000)

 #12 sequences of NA's with length 1:144 (randomly picked)
 naVecList<-lapply(sample(c(1:144),12,replace = T),function(x) rep(NA,x))

 #random positions (along the whole length of initVec)
 (randomPositions<-sort(unlist(lapply(seq_along(1:length(naVecList)), function(x) sample(c(1:(length(initVec)-144)),x,replace = T)[1]))))#added safenet


 #insert the NA elements at random places.
  for(i in 1:length(randomPositions))
    initVec[randomPositions[i]:(randomPositions[i]+length(naVecList[[i]]))]<-naVecList[[i]]
Rodriguez answered 16/6, 2017 at 15:49 Comment(1)
@Sandusky now there are better distributedRodriguez

© 2022 - 2024 — McMap. All rights reserved.