Find closest value in a vector with binary search

V

8

52

As a silly toy example, suppose

x=4.5
w=c(1,2,4,6,7)

I wonder if there is a simple R function that finds the index of the closest match to x in w. So if foo is that function, foo(w,x) would return 3. The function match is the right idea, but seems to apply only for exact matches.

Solutions here (e.g. which.min(abs(w - x)), which(abs(w-x)==min(abs(w-x))), etc.) are all O(n) instead of log(n) (I'm assuming that w is already sorted).

Vogele answered 21/11, 2013 at 22:33 Comment(1)

fuzzyjoin could be helpful in setting explicit criteria for and finding inexact matches according to the match that has the best score – Gora 23/7, 2021 at 7:36

K

45

You can use data.table to do a binary search:

dt = data.table(w, val = w) # you'll see why val is needed in a sec
setattr(dt, "sorted", "w")  # let data.table know that w is sorted

Note that if the column w isn't already sorted, then you'll have to use setkey(dt, w) instead of setattr(.).

# binary search and "roll" to the nearest neighbour
dt[J(x), roll = "nearest"]
#     w val
#1: 4.5   4

In the final expression the val column will have the you're looking for.

# or to get the index as Josh points out
# (and then you don't need the val column):
dt[J(x), .I, roll = "nearest", by = .EACHI]
#     w .I
#1: 4.5  3

# or to get the index alone
dt[J(x), roll = "nearest", which = TRUE]
#[1] 3

Kalle answered 21/11, 2013 at 22:48 Comment(17)

This must be 99% of the way to the answer. In the end, I want 3, the index of 4 in w. – Vogele 21/11, 2013 at 22:51

I had a similar thought, but given that the OP wants the vector's index, would have done: dt = data.table(w, key="w"); dt[J(x), .I,roll = "nearest"][[2]] – Kurman 21/11, 2013 at 22:52

@JoshO'Brien fair enough, I didn't read OP too carefully :), but don't use the key argument - that will force a resort – Kalle 21/11, 2013 at 22:53

@eddi, I don't think it's specified that the vector is always sorted. And even if it is, I think setkey checks for is.sorted(.) before sorting. – Frerichs 21/11, 2013 at 22:54

@Frerichs -- OP does mention that the vector can be assumed sorted, but I was thinking along the lines of your second sentence. Any way to mark the data.table as already sorted by a column without actually resorting it? – Kurman 21/11, 2013 at 22:56

@Frerichs is.sorted is O(n) when the vector is sorted and so takes a very long time. – Kalle 21/11, 2013 at 22:56

@eddi: x <- 1:1e7; system.time(is.unsorted(x)) (0.057 seconds). – Frerichs 21/11, 2013 at 23:0

@JoshO'Brien, I did miss the last line. Thanks. It's the way eddi has done it. setkey looks for attribute "sorted". But, even if it's sorted, it still checks to make sure it is sorted, iirc (as in some cases, you may have noticed the warning "the key was set, but not properly.. so setting again.." – Frerichs 21/11, 2013 at 23:1

@Frerichs -- So doing attributes(dt) <- c(attributes(dt), sorted="w") is not only hacky but ineffective! Sounds like good software design on the part of the data.table team. – Kurman 21/11, 2013 at 23:5

@Frerichs exactly ;) when you live in a world where you want to use a binary search for this problem, it's likely you also live in a world where that's 57 milliseconds more than 0. – Kalle 21/11, 2013 at 23:7

@Frerichs what about it? it is O(n) - try increasing/decreasing size of x - to check if something is sorted you have to go through the entire thing – Kalle 21/11, 2013 at 23:12

@Frerichs btw I just timed setkey and it's much slower than is.unsorted - haven't looked at the code yet but it does seem like it's sorting a sorted vector – Kalle 21/11, 2013 at 23:17

Hmm, it does seem like is.unsorted is not called unless the "attribute" is set. I wonder why.. I think there's speedup possible here. Will check. – Frerichs 21/11, 2013 at 23:21

One issue maybe with NA values though. – Frerichs 21/11, 2013 at 23:24

What is J in J(x) in dt[J(x), roll = "nearest"]? – Faxon 6/10, 2017 at 20:51

@ConnerM. it's a shortcut for data.table. Nowadays you can also use . instead of J. – Kalle 6/10, 2017 at 21:51

FWIW, if there are repeat values in the w vector, the above solution will give the index of the rightmost solution. To get the leftmost solution, add in mult='first', so the full line would be: dt[J(x), roll = "nearest", which = TRUE, mult='first'] – Klausenburg 10/4, 2019 at 17:22

H

46

R>findInterval(4.5, c(1,2,4,5,6))
[1] 3

will do that with price-is-right matching (closest without going over).

Hypaethral answered 10/4, 2015 at 3:38 Comment(4)

findInterval {base} Find Interval Numbers or Indices stat.ethz.ch/R-manual/R-devel/library/base/html/… – Autobahn 12/8, 2015 at 17:13

To get the nearest element using this approach you can do search in intervals from mid points between adjacent target points: w[findInterval(x, (w[-length(w)] + w[-1]) / 2) + 1] – Transcurrent 1/12, 2016 at 7:24

@Transcurrent That should work, but its O(n) instead of O(log n) because of the midpoint calculation. – Hypaethral 22/12, 2016 at 8:12

@NealFultz, you're right. For performance simple "if" check for the distance to the next point is enough. if (res == 0 || (res != length(w) && w[res + 1] - x < x - w[res])) res <- res + 1 – Transcurrent 22/12, 2016 at 9:13

K

45

You can use data.table to do a binary search:

dt = data.table(w, val = w) # you'll see why val is needed in a sec
setattr(dt, "sorted", "w")  # let data.table know that w is sorted

Note that if the column w isn't already sorted, then you'll have to use setkey(dt, w) instead of setattr(.).

# binary search and "roll" to the nearest neighbour
dt[J(x), roll = "nearest"]
#     w val
#1: 4.5   4

In the final expression the val column will have the you're looking for.

# or to get the index as Josh points out
# (and then you don't need the val column):
dt[J(x), .I, roll = "nearest", by = .EACHI]
#     w .I
#1: 4.5  3

# or to get the index alone
dt[J(x), roll = "nearest", which = TRUE]
#[1] 3