Find closest value in a vector with binary search
Asked Answered
V

8

52

As a silly toy example, suppose

x=4.5
w=c(1,2,4,6,7)

I wonder if there is a simple R function that finds the index of the closest match to x in w. So if foo is that function, foo(w,x) would return 3. The function match is the right idea, but seems to apply only for exact matches.

Solutions here (e.g. which.min(abs(w - x)), which(abs(w-x)==min(abs(w-x))), etc.) are all O(n) instead of log(n) (I'm assuming that w is already sorted).

Vogele answered 21/11, 2013 at 22:33 Comment(1)
fuzzyjoin could be helpful in setting explicit criteria for and finding inexact matches according to the match that has the best scoreGora
K
45

You can use data.table to do a binary search:

dt = data.table(w, val = w) # you'll see why val is needed in a sec
setattr(dt, "sorted", "w")  # let data.table know that w is sorted

Note that if the column w isn't already sorted, then you'll have to use setkey(dt, w) instead of setattr(.).

# binary search and "roll" to the nearest neighbour
dt[J(x), roll = "nearest"]
#     w val
#1: 4.5   4

In the final expression the val column will have the you're looking for.

# or to get the index as Josh points out
# (and then you don't need the val column):
dt[J(x), .I, roll = "nearest", by = .EACHI]
#     w .I
#1: 4.5  3

# or to get the index alone
dt[J(x), roll = "nearest", which = TRUE]
#[1] 3
Kalle answered 21/11, 2013 at 22:48 Comment(17)
This must be 99% of the way to the answer. In the end, I want 3, the index of 4 in w.Vogele
I had a similar thought, but given that the OP wants the vector's index, would have done: dt = data.table(w, key="w"); dt[J(x), .I,roll = "nearest"][[2]]Kurman
@JoshO'Brien fair enough, I didn't read OP too carefully :), but don't use the key argument - that will force a resortKalle
@eddi, I don't think it's specified that the vector is always sorted. And even if it is, I think setkey checks for is.sorted(.) before sorting.Frerichs
@Frerichs -- OP does mention that the vector can be assumed sorted, but I was thinking along the lines of your second sentence. Any way to mark the data.table as already sorted by a column without actually resorting it?Kurman
@Frerichs is.sorted is O(n) when the vector is sorted and so takes a very long time.Kalle
@eddi: x <- 1:1e7; system.time(is.unsorted(x)) (0.057 seconds).Frerichs
@JoshO'Brien, I did miss the last line. Thanks. It's the way eddi has done it. setkey looks for attribute "sorted". But, even if it's sorted, it still checks to make sure it is sorted, iirc (as in some cases, you may have noticed the warning "the key was set, but not properly.. so setting again.."Frerichs
@Frerichs -- So doing attributes(dt) <- c(attributes(dt), sorted="w") is not only hacky but ineffective! Sounds like good software design on the part of the data.table team.Kurman
@Frerichs exactly ;) when you live in a world where you want to use a binary search for this problem, it's likely you also live in a world where that's 57 milliseconds more than 0.Kalle
@Frerichs what about it? it is O(n) - try increasing/decreasing size of x - to check if something is sorted you have to go through the entire thingKalle
@Frerichs btw I just timed setkey and it's much slower than is.unsorted - haven't looked at the code yet but it does seem like it's sorting a sorted vectorKalle
Hmm, it does seem like is.unsorted is not called unless the "attribute" is set. I wonder why.. I think there's speedup possible here. Will check.Frerichs
One issue maybe with NA values though.Frerichs
What is J in J(x) in dt[J(x), roll = "nearest"]?Faxon
@ConnerM. it's a shortcut for data.table. Nowadays you can also use . instead of J.Kalle
FWIW, if there are repeat values in the w vector, the above solution will give the index of the rightmost solution. To get the leftmost solution, add in mult='first', so the full line would be: dt[J(x), roll = "nearest", which = TRUE, mult='first']Klausenburg
H
46
R>findInterval(4.5, c(1,2,4,5,6))
[1] 3

will do that with price-is-right matching (closest without going over).

Hypaethral answered 10/4, 2015 at 3:38 Comment(4)
findInterval {base} Find Interval Numbers or Indices stat.ethz.ch/R-manual/R-devel/library/base/html/…Autobahn
To get the nearest element using this approach you can do search in intervals from mid points between adjacent target points: w[findInterval(x, (w[-length(w)] + w[-1]) / 2) + 1]Transcurrent
@Transcurrent That should work, but its O(n) instead of O(log n) because of the midpoint calculation.Hypaethral
@NealFultz, you're right. For performance simple "if" check for the distance to the next point is enough. if (res == 0 || (res != length(w) && w[res + 1] - x < x - w[res])) res <- res + 1Transcurrent
K
45

You can use data.table to do a binary search:

dt = data.table(w, val = w) # you'll see why val is needed in a sec
setattr(dt, "sorted", "w")  # let data.table know that w is sorted

Note that if the column w isn't already sorted, then you'll have to use setkey(dt, w) instead of setattr(.).

# binary search and "roll" to the nearest neighbour
dt[J(x), roll = "nearest"]
#     w val
#1: 4.5   4

In the final expression the val column will have the you're looking for.

# or to get the index as Josh points out
# (and then you don't need the val column):
dt[J(x), .I, roll = "nearest", by = .EACHI]
#     w .I
#1: 4.5  3

# or to get the index alone
dt[J(x), roll = "nearest", which = TRUE]
#[1] 3
Kalle answered 21/11, 2013 at 22:48 Comment(17)
This must be 99% of the way to the answer. In the end, I want 3, the index of 4 in w.Vogele
I had a similar thought, but given that the OP wants the vector's index, would have done: dt = data.table(w, key="w"); dt[J(x), .I,roll = "nearest"][[2]]Kurman
@JoshO'Brien fair enough, I didn't read OP too carefully :), but don't use the key argument - that will force a resortKalle
@eddi, I don't think it's specified that the vector is always sorted. And even if it is, I think setkey checks for is.sorted(.) before sorting.Frerichs
@Frerichs -- OP does mention that the vector can be assumed sorted, but I was thinking along the lines of your second sentence. Any way to mark the data.table as already sorted by a column without actually resorting it?Kurman
@Frerichs is.sorted is O(n) when the vector is sorted and so takes a very long time.Kalle
@eddi: x <- 1:1e7; system.time(is.unsorted(x)) (0.057 seconds).Frerichs
@JoshO'Brien, I did miss the last line. Thanks. It's the way eddi has done it. setkey looks for attribute "sorted". But, even if it's sorted, it still checks to make sure it is sorted, iirc (as in some cases, you may have noticed the warning "the key was set, but not properly.. so setting again.."Frerichs
@Frerichs -- So doing attributes(dt) <- c(attributes(dt), sorted="w") is not only hacky but ineffective! Sounds like good software design on the part of the data.table team.Kurman
@Frerichs exactly ;) when you live in a world where you want to use a binary search for this problem, it's likely you also live in a world where that's 57 milliseconds more than 0.Kalle
@Frerichs what about it? it is O(n) - try increasing/decreasing size of x - to check if something is sorted you have to go through the entire thingKalle
@Frerichs btw I just timed setkey and it's much slower than is.unsorted - haven't looked at the code yet but it does seem like it's sorting a sorted vectorKalle
Hmm, it does seem like is.unsorted is not called unless the "attribute" is set. I wonder why.. I think there's speedup possible here. Will check.Frerichs
One issue maybe with NA values though.Frerichs
What is J in J(x) in dt[J(x), roll = "nearest"]?Faxon
@ConnerM. it's a shortcut for data.table. Nowadays you can also use . instead of J.Kalle
FWIW, if there are repeat values in the w vector, the above solution will give the index of the rightmost solution. To get the leftmost solution, add in mult='first', so the full line would be: dt[J(x), roll = "nearest", which = TRUE, mult='first']Klausenburg
F
8

See match.closest() from the MALDIquant package:

> library(MALDIquant)
> match.closest(x, w)
[1] 3
Fernandofernas answered 2/10, 2018 at 18:52 Comment(0)
I
4
NearestValueSearch = function(x, w){
  ## A simple binary search algo
  ## Assume the w vector is sorted so we can use binary search
  left = 1
  right = length(w)
  while(right - left > 1){
    middle = floor((left + right) / 2)
    if(x < w[middle]){
      right = middle
    }
    else{
      left = middle
    }
  }
  if(abs(x - w[right]) < abs(x - w[left])){
    return(right)
  }
  else{
    return(left)
  }
}


x = 4.5
w = c(1,2,4,6,7)
NearestValueSearch(x, w) # return 3
Indescribable answered 26/6, 2017 at 15:18 Comment(2)
This algorithm is extremely fast compared to the other proposals: w = cumsum(abs(rnorm(100000))) microbenchmark::microbenchmark(NearestValueSearch(777,w)) # 12 µs microbenchmark::microbenchmark(which.min(abs(w-777))) # 300 µs microbenchmark::microbenchmark(findInterval(777,w)) # 140 µs Pincushion
@egus it is however much slower than findInterval if you are searching for many values (NearsestValueSearch would require wrapping in sapply) and in contrast to the other solutions will silently fail, when the assumptions are not met.Berryman
H
3
x = 4.5
w = c(1,2,4,6,7)

closestLoc = which(min(abs(w-x)))
closestVal = w[which(min(abs(w-x)))]

# On my phone- please pardon typos

If your vector is lengthy, try a 2-step approach:

x = 4.5
w = c(1,2,4,6,7)

sdev = sapply(w,function(v,x) abs(v-x), x = x)
closestLoc = which(min(sdev))

for maddeningly long vectors (millions of rows!, warning- this will actually be slower for data which is not very, very, very large.)

require(doMC)
registerDoMC()

closestLoc = which(min(foreach(i = w) %dopar% {
   abs(i-x)
}))

This example is just to give you a basic idea of leveraging parallel processing when you have huge data. Note, I do not recommend you use it for simple & fast functions like abs().

Hexastyle answered 21/11, 2013 at 22:41 Comment(1)
Just saw. data.table is the way to go!Hexastyle
H
3

To do this on character vectors, Martin Morgan suggested this function on R-help:

bsearch7 <-
     function(val, tab, L=1L, H=length(tab))
{
     b <- cbind(L=rep(L, length(val)), H=rep(H, length(val)))
     i0 <- seq_along(val)
     repeat {
         updt <- M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L
         tabM <- tab[M]
         val0 <- val[i0]
         i <- tabM < val0
         updt[i] <- M[i] + 1L
         i <- tabM > val0
         updt[i] <- M[i] - 1L
         b[i0 + i * length(val)] <- updt
         i0 <- which(b[i0, "H"] >= b[i0, "L"])
         if (!length(i0)) break;
     }
     b[,"L"] - 1L
} 
Hypaethral answered 11/4, 2015 at 7:1 Comment(0)
A
1

Based on @neal-fultz answer, here is a simple function that uses findInterval():

get_closest_index <- function(x, vec){
  # vec must be sorted
  iv <- findInterval(x, vec)
  dist_left <- x - vec[ifelse(iv == 0, NA, iv)]
  dist_right <- vec[iv + 1] - x
  ifelse(! is.na(dist_left) & (is.na(dist_right) | dist_left < dist_right), iv, iv + 1)
}
values <- c(-15, -0.01, 3.1, 6, 10, 100)
grid <- c(-2, -0.1, 0.1, 3, 7)
get_closest_index(values, grid)
#> [1] 1 2 4 5 5 5

Created on 2020-05-29 by the reprex package (v0.3.0)

Aggress answered 29/5, 2020 at 12:56 Comment(0)
X
-2

You can always implement custom binary search algorithm to find the closest value. Alternately, you can leverage standard implementation of libc bsearch(). You can use other binary search implementations as well, but it does not change the fact that you have to implement the comparing function carefully to find the closest element in array. The issue with standard binary search implementation is that it is meant for exact comparison. That means your improvised comparing function needs to do some kind of exactification to figure out if an element in array is close-enough. To achieve it, the comparing function needs to have awareness of other elements in the array, especially following aspects:

  • position of the current element (one which is being compared with the key).
  • the distance with key and how it compares with neighbors (previous or next element).

To provide this extra knowledge in comparing function, the key needs to be packaged with additional information (not just the key value). Once the comparing function have awareness on these aspects, it can figure out if the element itself is closest. When it knows that it is the closest, it returns "match".

The the following C code finds the closest value.

#include <stdio.h>
#include <stdlib.h>

struct key {
        int key_val;
        int *array_head;
        int array_size;
};

int compar(const void *k, const void *e) {
        struct key *key = (struct key*)k;
        int *elem = (int*)e;
        int *arr_first = key->array_head;
        int *arr_last = key->array_head + key->array_size -1;
        int kv = key->key_val;
        int dist_left;
        int dist_right;

        if (kv == *elem) {
                /* easy case: if both same, got to be closest */
                return 0;
        } else if (key->array_size == 1) {
                /* easy case: only element got to be closest */
                return 0;
        } else if (elem == arr_first) {
                /* element is the first in array */
                if (kv < *elem) {
                        /* if keyval is less the first element then
                         * first elem is closest.
                         */
                        return 0;
                } else {
                        /* check distance between first and 2nd elem.
                         * if distance with first elem is smaller, it is closest.
                         */
                        dist_left = kv - *elem;
                        dist_right = *(elem+1) - kv;
                        return (dist_left <= dist_right) ? 0:1;
                }
        } else if (elem == arr_last) {
                /* element is the last in array */
                if (kv > *elem) {
                        /* if keyval is larger than the last element then
                         * last elem is closest.
                         */
                        return 0;
                } else {
                        /* check distance between last and last-but-one.
                         * if distance with last elem is smaller, it is closest.
                         */
                        dist_left = kv - *(elem-1);
                        dist_right = *elem - kv;
                        return (dist_right <= dist_left) ? 0:-1;
                }
        }

        /* condition for remaining cases (other cases are handled already):
         * - elem is neither first or last in the array
         * - array has atleast three elements.
         */

        if (kv < *elem) {
                /* keyval is smaller than elem */

                if (kv <= *(elem -1)) {
                        /* keyval is smaller than previous (of "elem") too.
                         * hence, elem cannot be closest.
                         */
                        return -1;
                } else {
                        /* check distance between elem and elem-prev.
                         * if distance with elem is smaller, it is closest.
                         */
                        dist_left = kv - *(elem -1);
                        dist_right = *elem - kv;
                        return (dist_right <= dist_left) ? 0:-1;
                }
        }

        /* remaining case: (keyval > *elem) */

        if (kv >= *(elem+1)) {
                /* keyval is larger than next (of "elem") too.
                 * hence, elem cannot be closest.
                 */
                return 1;
        }

        /* check distance between elem and elem-next.
         * if distance with elem is smaller, it is closest.
         */
        dist_right = *(elem+1) - kv;
        dist_left = kv - *elem;
        return (dist_left <= dist_right) ? 0:1;
}


int main(int argc, char **argv) {
        int arr[] = {10, 20, 30, 40, 50, 60, 70};
        int *found;
        struct key k;

        if (argc < 2) {
                return 1;
        }

        k.key_val = atoi(argv[1]);
        k.array_head = arr;
        k.array_size = sizeof(arr)/sizeof(int);

        found = (int*)bsearch(&k, arr, sizeof(arr)/sizeof(int), sizeof(int),
                compar);

        if(found) {
                printf("found closest: %d\n", *found);
        } else {
                printf("closest not found. absurd! \n");
        }

        return 0;
}

Needless to say that bsearch() in above example should never fail (unless the array size is zero).

If you implement your own custom binary search, essentially you have to embed same comparing logic in the main body of binary search code (instead of having this logic in comparing function in above example).

Xylo answered 1/5, 2017 at 18:25 Comment(2)
In addition, there are other tricks can be played as well. One of the property of binary search (or other searches as well) is that when search fails, the last comparison fails within the "proximity" of the key. For example, if the key is larger than the last element of the array, the last failed comparison is going to be with the last element. If key within the the range of array, the last two comparisons are going be with the elements located at the left and right sides of the key. If comparison function save distances, you can figure out the closest value even when bsearch() fails.Xylo
That is a C function, not an R function.Hypaethral

© 2022 - 2024 — McMap. All rights reserved.