Is there a faster way to find the first value that is not NA in a large vector using base R? [closed]
Asked Answered
D

5

5

Just like the question says. Is there a faster way to do what is done below when the vector size is very large (> 10M entries) using base R?

The code below works, but when the vector size grows it becomes slow for what should be obvious reasons. In this particular example a for loop might be faster but if the first NA value is very far from the start of the vector maybe not...

set.seed(1)
x <- c(rep(NA, 3), sample(c(T,F), size=5e7, replace=T))
min(which(!is.na(x))) #4
Disreputable answered 25/10 at 13:16 Comment(4)
How fast is acceptably "faster"? How slow, exactly, does it become and at what vector size? This question is lacking objective metrics by which to answer it satisfactorily.Hanley
I and several others who have answered the question think the point is clear. The code above is very inefficient and does not short-circuit when it actually has two opportunities to do so. The excellent answer by Sam_R quantifies exactly what you've asked.Disreputable
I didn't comment about the point; it's clear what you want, but what's not clear (because they are completely absent) are the criteria here. Surely as the question author you can easily edit the question to include information about the specific point where the vector size growth becomes a problem, and list how fast would be acceptable (even just at an extremely high level like O(n) complexity, if that applies). See minimal reproducible example for more information on what kind of information is needed when you're asking help with solving a problem.Hanley
See e.g. What makes a "good" performance question on SO?Hanley
G
8

An Rcpp for loop should be fast even if the first non-NA value is not particularly near the start. The basic idea is that you do not need to keep iterating after you have found the first value.

Rcpp::cppFunction("
int which_non_na(LogicalVector x) {
    R_xlen_t n = x.size();
    for (R_xlen_t i = 0; i < n; ++i) {
        if (!LogicalVector::is_na(x[i])) {
            return i + 1; // +1 for 1-indexing in R
        }
    }
    return NA_INTEGER; // if no non-NA values are found
}
")

which_non_na(x) # 4

Benchmarks

Here are some benchmarks for some vectors of length 1e3 to 1e8. For a vector of length 1,000 this is about 9 times faster than the R approach. For a vector of length 100,000,000, it is on average over a million times faster. I've added the other answers into the benchmark, too. which.min() is faster than min(which()) but for the largest vector here this is still about 200,000 times faster than that.

bench::press(
    vec_length = 10^(3:8),
    {
        # set probability of NA (so it won't necessarily be early)
        prob_na <- min(0.9, log10(vec_length) / 10)
        probs <- c(prob_na, (1 - prob_na) / 2, (1 - prob_na) / 2)

        set.seed(1)
        x <- sample(c(NA, TRUE, FALSE), size = vec_length, replace = TRUE, prob = probs)

        bench::mark(
            relative = TRUE,
            base_r = min(which(!is.na(x))),
            which_min = which.min(is.na(x)),
            which_1 = which(!is.na(x))[1],
            rcpp = which_non_na(x)
        )
    }
)

Output:

   expression vec_length        min     median  `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
   <bch:expr>      <dbl>      <dbl>      <dbl>      <dbl>     <dbl>    <dbl> <int> <dbl>   <bch:tm>
 1 base_r           1000       7.77       9.04       1          Inf      Inf  9999     1    32.37ms
 2 which_min        1000       1.76       2.14       4.75       Inf      NaN 10000     0     6.82ms
 3 which_1          1000       6.69       7.60       1.34       Inf      NaN 10000     0    24.17ms
 4 rcpp             1000       1          1          9.77       NaN      NaN 10000     0     3.31ms
 5 base_r          10000      71.2      121.         1          Inf      Inf  9998     2   451.83ms
 6 which_min       10000      10.3       10.1       11.1        Inf      NaN 10000     0    40.78ms
 7 which_1         10000      63.6       59.0        1.99       Inf      Inf  9998     2   227.42ms
 8 rcpp            10000       1          1        117.         NaN      NaN 10000     0     3.87ms
 9 base_r         100000    1216.      1030.         1.03       Inf      Inf  1288     2   493.42ms
10 which_min      100000     101.        99.5        7.46       Inf      Inf  8903     4   471.91ms
11 which_1        100000    1159.      1002.         1          Inf      Inf  1252     2   494.89ms
12 rcpp           100000       1          1        980.         NaN      NaN 10000     0     4.03ms
13 base_r        1000000   11177.      9625.         1          Inf      Inf   132     3   483.53ms
14 which_min     1000000     962.       970.         9.26       Inf      Inf  1226     6   485.01ms
15 which_1       1000000   10905.      9247.         1.06       Inf      Inf   142     2   489.28ms
16 rcpp          1000000       1          1       8581.         NaN      NaN 10000     0     4.27ms
17 base_r       10000000  108934.     87149.         1.04       Inf      Inf    12     3   408.38ms
18 which_min    10000000   13676.     17399.         5.78       Inf      Inf    75     5   458.28ms
19 which_1      10000000  105301.     95828.         1          Inf      Inf    13     2   459.48ms
20 rcpp         10000000       1          1      84229.         NaN      NaN 10000     0      4.2ms
21 base_r      100000000 1238681.   1224252.         1          Inf      Inf     2     3   749.38ms
22 which_min   100000000  222405.    224765.         5.39       Inf      Inf     8     4   555.68ms
23 which_1     100000000 1067243.   1053431.         1.16       Inf      Inf     2     4   644.82ms
24 rcpp        100000000       1          1    1116208.         NaN      NaN 10000     0     3.36ms
Gusman answered 25/10 at 13:21 Comment(12)
I've been avoiding C++ for years. Looks like I have another reason not to now.Disreputable
Try a microbenchmark for easier comparison. I just did against the methods proposed in the other answers, you are running circles around them.Penny
@AdaLovelace if you find one let me know - sounds useful! I thought it would be faster - firstly because why would you keep checking once you've met the condition, but also it's the same principle as my answer to this which looks backwards from the end of a vector and breaks early, and that was pretty fast.Gusman
I do know (as I once looked at it) that any() in Rcpp Sugar breaks early as you do here --- but I think it won't give us the index. So your quick and simple loop is likely the winner.Penny
Your table is close to impossible to read because bench::mark insists on mixing units. Can you maybe turn the comparison into plot? I know microbenchmark makes that easy.Penny
Good point. 10^8 is larger that (32-bit) integer max aka .Machine$integer.max. On the other hand we only need the first match to happen before that value ==:-)Penny
@AdaLovelace Others might want to use the function in their production code. Best to always support long vectors in Rcpp code when possible.Juliannajulianne
@Juliannajulianne I kinda agree and often use size_t from C/C++ myself.Penny
@AdaLovelace was the which equivalent you looking for this one, written by someone who is (allegedly) no longer here?Key
@AllanCameron Good find! Maybe I was thinking of an RcppArmadillo based function. I really though Rcpp Sugar had one but apparently not.Penny
@AllanCameron I just benchmarked it: it is also slow as it does not 'quit early' on first find.Penny
Sam, and @AdaLovelace please see my answer, in which the same algorithm as fed to Rcpp runs much faster as a plain R code algo.Foret
T
7

You can use which.min and which.max (without is.na()), since as said in the manual:

Missing and NaN values are discarded.

set.seed(1)
x <- c(rep(NA, 3), sample(c(T, F), size = 5e7, replace = T))

f1 <- \() min(which(!is.na(x)))
f2 <- \() which.min(is.na(x))
f3 <- \() min(which.min(x), which.max(x), Inf)
microbenchmark(
  f1 = f1(),
  f2 = f2(),
  f3 = f3(),
  unit = "relative",
  check = "equal"
)

and you will see

Unit: relative
 expr      min       lq      mean    median        uq       max neval
   f1 931928.5 596652.0 12690.309 31796.371 33326.273 232.24601   100
   f2 139656.2  98002.5  2770.152  5879.894  8057.321  84.39669   100
   f3      1.0      1.0     1.000     1.000     1.000   1.00000   100

Remark

From the benchmarking, it seems is.na is an expensive operator from the speed perspective, especially when the size of array scales up

Thrash answered 25/10 at 13:31 Comment(6)
Nice. Could you please add the Rcpp-generated function too?Penny
It's not is.na that is expensive, it is R's promise to check all elements in the vector requiring the loop to visit all N elements. As the actual question only asks for first match one can do much better. Similar helpers exist in other languages too as it is both a common task, and an easy optmization to not visit the remainer of the vector when the answer is already given.Penny
This is a very clever answer but it relies on x being a logical vector which I now regret choosing in my example as I also wanted this to work for numeric data. Using which.min() is almost an order of magnitude speed up though.Disreputable
@Disreputable in that case, you can use !x or x>0 to produce boolean values from numeric values, then apply the same approach as aboveThrash
@Thrash I think you'd lose the speed gain from this nice answer if you did that as R would have to allocate some contiguous memory of the size of the vector to create a temporary logical vector. Did you try the base R Position() or Find() functions? I've never found a reason to use them but this seems like it could be one...Gusman
@Gusman interesting observation! I didn't use Position or Find before, but seems valuable to dive intoThrash
V
6

Very late to the party but didn't see anyone mention Position() that does what you are looking for and stops scanning as soon as a match is found:

Position(\(x) !is.na(x), x) # 4
Vincenzovincible answered 26/10 at 10:13 Comment(0)
T
1

A little faster than using min and base R

which(!is.na(x))[1]

finds the first non-NA value from 100,000,000 values in less than a second, even on a crappy cpu.

Tristan answered 25/10 at 13:37 Comment(7)
I casually benchmarked this against the others and this by far the worst. Note that in your proposal the whole vector needs to be checked whereas we know we only want the first match.Penny
@AdaLovelace I've used microbenchmark on 100Mill values and this ranked before the min approach.Tristan
That claim does not replicate here. I injected your function in the setup shown in the other answer and yours scores about 4.2 milliseconds (and 4.7 for min(which(...))) in the median whereas Rcpp comes up at 5 microseconds. Roughly 1000 times faster. Using which.min() is respectable too at 0.7 milliseconds.Penny
@AdaLovelace That claim does not replicate here. It does since I said "A little faster than using min".Tristan
But you are entirely missing the point of OP who wants a fast breaking function that aborts at the first step. which,min() does that, roughly 5x faster as I mentioned -- and Rcpp does it 1000x. Are you sure you are adding value here? I am not.Penny
@AdaLovelace I am not missing any point here! We search a fast base R solution. And this is already fast. So what am I missing?Tristan
Not to the OP who asked the question. Anyway, we are not making progress here and you just keep insisting on what I consider a non-answer.Penny
F
0

on an Intel Mac, OS 13.5.2

Edit: retest with purely logical input

 x <- as.logical(c(NA,NA,NA,rep(1,times=20), rep(NA,times=5),rep(1,times=biglen)))
Rgames> microbenchmark(which_non_na(x),nonNA(x))
Unit: nanoseconds
            expr   min      lq     mean  median      uq    max neval
 which_non_na(x)   691   735.0  1192.88   898.0  1342.5  16088   100
        nonNA(x) 45624 46285.5 47195.47 46569.5 46928.5 102274   100

Dunno how c++ would fare if it ran on numeric inputs but didn't declare "logical.

nonNA <- function(x) {
    for(jn in 1:length(x)) {
        if(!is.na(x[jn])) return( jn)
    }
    return('not found')
}

biglen=1e6
x <- c(NA,NA,NA,rep(1,times=20), rep(NA,times=5),rep(1,times=biglen))

microbenchmark(which_non_na(x),nonNA(x))
Unit: microseconds
            expr      min       lq       mean    median       uq       max neval
 which_non_na(x) 1432.809 1596.131 2398.95178 1606.1205 1615.342 11521.484   100
        nonNA(x)   46.436   47.125   55.06929   47.8625   56.734   111.771   100

biglen=1e8
 x <- c(NA,NA,NA,rep(1,times=20), rep(NA,times=5),rep(1,times=biglen))
microbenchmark(which_non_na(x),nonNA(x))
Unit: microseconds
            expr        min          lq         mean      median         uq        max neval
 which_non_na(x) 322243.196 327702.4310 349913.34584 353146.7080 370955.373 385839.443   100
        nonNA(x)     46.642     48.3305     86.83075    100.3525    115.982    256.817   100
Foret answered 25/10 at 19:36 Comment(3)
I think what's happening is that x here is not a logical vector so the Rcpp function is type casting which makes a copy. I'd be interested to see whether the results hold if you define x as in the question. Still, this shows that pure R may be a more robust solution if you can't guarantee the types.Gusman
@Gusman Looks like you are correct -- see updated answer with another test caseForet
Interestingly this is what Position() does internally (as demonstrated in @s_baldur's answer).Kincaid

© 2022 - 2024 — McMap. All rights reserved.