How can I efficiently find the index of a value in a sorted array?
Asked Answered
T

3

10

I have a sorted array of values and a single value like so:

x <- c(1.0, 3.45, 5.23, 7.3, 12.5, 23.45)
v <- 6.45

I can find the index of the value after which v would be inserted into x while maintaining the sorting order:

max(which(x <= v))
[1] 3

It is nice and compact code, but I have the gut feeling that behind-the-scenes this is really inefficient: since which() does not know that the array is sorted it has to inspect all values.

Is there a better way of finding this index value?

Note: I am not interested in actually merging v into x. I just want the index value.

Troth answered 6/12, 2021 at 4:21 Comment(4)
Check out the ?findInterval function.Convoluted
@Convoluted Just what I was looking for! Since I didn't find it (and I tried, in multiple ways), perhaps others may benefit from this little gem too. Make this an answer and you'll have my vote.Troth
@tjebo If you run findInterval(v, x) with this sample data it returns 3 as expected. It doesn't return a vector of 0/1s. I'm not sure why your version is doing that.Convoluted
Related (albeit not explicilty about efficiency): Position in vector based on approximate matching (including findInterval)Loquitur
D
7

You can use findInterval which makes use of a binary search.

findInterval(v, x)
#[1] 3

Or using C++ upper_bound with Rcpp.

Rcpp::cppFunction(
  "int upper_bound(double v, const Rcpp::NumericVector& x) {
     return std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v));
}")

upper_bound(v, x)
#[1] 3

Or in case you have also a vector of positions like in findInterval.

Rcpp::cppFunction("
Rcpp::IntegerVector upper_bound2(const Rcpp::NumericVector& v
                               , const Rcpp::NumericVector& x) {
  Rcpp::IntegerVector res(v.length());
  for(int i=0; i < res.length(); ++i) {
    res[i] = std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v[i]));
  }
  return res;
}")

v <- c(3, 6.45)
upper_bound2(v, x)
#[1] 1 3
findInterval(v, x)
#[1] 1 3
Devolve answered 6/12, 2021 at 9:5 Comment(0)
D
8

Benchmark based on Егор-Шишунов's answer:

# Functions:
Rcpp::cppFunction(
  "int Erop(double x, const Rcpp::NumericVector& v)
  {
    int min = 0;
    int max = v.size();
    while (max - min > 1)
    {
      int idx = (min + max) / 2;
      if (v[idx] > x)
      {
        max = idx;
      }
      else
      {
        min = idx;
      }
    }
    return min + 1;
  }"
)

Rcpp::cppFunction(
  "int GKi(double v, const Rcpp::NumericVector& x) {
     return std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v));
}")

Rcpp::cppFunction("
Rcpp::IntegerVector GKi2(const Rcpp::NumericVector& v 
                       , const Rcpp::NumericVector& x) {
  Rcpp::IntegerVector res(v.length());
  for(int i=0; i < res.length(); ++i) {
    res[i] = std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v[i]));
  }
  return res;
}")
# Data:
set.seed(42)
x <- sort(rnorm(1e6))
v <- sort(c(sample(x, 15), rnorm(15)))
# Result:
bench::mark(whichMax= sapply(v, \(v) max(which(x <= v)))
          , sum = sapply(v, \(v) sum(x<=v))
          , findInterval = findInterval(v, x)
          , Erop = sapply(v, \(v) Erop(v, x))
          , GKi = sapply(v, \(v) GKi(v, x))
          , GKi2 = GKi2(v, x)
)
#  expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 whichMax      92.03ms 102.32ms      9.15        NA    102.      5    56
#2 sum           74.91ms  77.84ms     12.0         NA     37.9     6    19
#3 findInterval 680.41µs 755.61µs   1263.          NA      0     632     0
#4 Erop          57.19µs  62.13µs  12868.          NA     24.0  6432    12
#5 GKi           54.53µs   60.4µs  13316.          NA     24.0  6657    12
#6 GKi2           2.02µs   2.38µs 386027.          NA      0   10000     0
Devolve answered 6/12, 2021 at 4:21 Comment(2)
What coding standard prescribes leading commas?Unconformable
Maybe I have it from SQL: The next value on the list is dependant on the comma, and so, they should be kept together. Also Emacs (ESS) supports it with correct indention.Devolve
F
7

If you need a faster version and you don't need to check your inputs you can write an easy C++ function:

Rcpp::cppFunction(
  "int foo(double x, const Rcpp::NumericVector& v)
  {
    int min = 0;
    int max = v.size();
    while (max - min > 1)
    {
      int idx = (min + max) / 2;
      if (v[idx] > x)
      {
        max = idx;
      }
      else
      {
        min = idx;
      }
    }
    return min + 1;
  }"
)

If you need it, you can check if (x < v[0]) by yourself (I don't know what you want to see in this case). And you can test it by using package microbenchmark:

library(microbenchmark)

n = 1e6
v = sort(rnorm(n, 0, 15))
x = runif(1, -15, 15)
microbenchmark(max(which(v <= x)), sum(v <= x), findInterval(x, v), foo(x, v))

Result:

Enter image description here

Faruq answered 6/12, 2021 at 7:32 Comment(2)
nice. would you care adding the sum(…) solution as per henriks linked thread? would be nice to see how it compares to the other solutionsWauters
GKi made a more qualified test, I think he can add sum(v <= x) (If I understand what you want). But I update my test.Pyrophosphate
D
7

You can use findInterval which makes use of a binary search.

findInterval(v, x)
#[1] 3

Or using C++ upper_bound with Rcpp.

Rcpp::cppFunction(
  "int upper_bound(double v, const Rcpp::NumericVector& x) {
     return std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v));
}")

upper_bound(v, x)
#[1] 3

Or in case you have also a vector of positions like in findInterval.

Rcpp::cppFunction("
Rcpp::IntegerVector upper_bound2(const Rcpp::NumericVector& v
                               , const Rcpp::NumericVector& x) {
  Rcpp::IntegerVector res(v.length());
  for(int i=0; i < res.length(); ++i) {
    res[i] = std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v[i]));
  }
  return res;
}")

v <- c(3, 6.45)
upper_bound2(v, x)
#[1] 1 3
findInterval(v, x)
#[1] 1 3
Devolve answered 6/12, 2021 at 9:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.