match() values with tolerance

Asked 20/10, 2015 at 0:34 Answered 20/10, 2015 at 13:11

I'm subsetting a dataset before plotting, but the key being numeric I cannot use the strict equality testing of match() or %in% (it misses a few values). I wrote the following alternative, but I imagine this problem is sufficiently common that there's a better built-in alternative somewhere? all.equal doesn't seem to be designed for multiple test values.

select_in <- function(x, ref, tol=1e-10){
  testone <- function(value) abs(x - value) < tol
  as.logical(rowSums(sapply(ref, testone)) )
}

x = c(1.0, 1+1e-13, 1.01, 2, 2+1e-9, 2-1e-11)
x %in% c(1,2,3)
#[1]  TRUE FALSE FALSE  TRUE FALSE FALSE
select_in(x, c(1, 2, 3))
#[1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE

Tantara answered 20/10, 2015 at 0:34 Comment(1)

@Frank nope :) please post as an answer – Tantara 20/10, 2015 at 1:38

This seems to achieve the goal (albeit not quite with a tolerance):

fselect_in <- function(x, ref, d = 10){
  round(x, digits=d) %in% round(ref, digits=d)
}

fselect_in(x, c(1,2,3))
# TRUE  TRUE FALSE  TRUE FALSE  TRUE

Mikkel answered 20/10, 2015 at 1:43 Comment(1)

ref being numeric in my case, I had to round both x and ref to the same precision – Tantara 20/10, 2015 at 1:44

Not sure how much better it is but all.equal has a tolerance argument that will work:

`%~%` <- function(x,y) sapply(x, function(.x) {
 any(sapply(y, function(.y) isTRUE(all.equal(.x, .y, tolerance=tol))))
})

x %~% c(1,2,3)
[1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE

I don't like having two apply functions there. I'll try to shorten it.

update

Another way that might be faster without using all.equal. It turns out to be much faster than the first solution:

`%~%` <- function(x,y) {
out <- logical(length(x))
for(i in 1:length(x)) out[i] <- any(abs(x[i] - y) <= tol)
out
}

x %~% c(1,2,3)
[1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE

Benchmark

big.x <- rep(x, 1e3)
big.y <- rep(y, 100)

all.equal(select_in(big.x, big.y), big.x %~% big.y)
[1] TRUE

library(microbenchmark)
microbenchmark(
  baptiste = select_in(big.x, big.y),
  plafort2 = big.x %~% big.y,
  times=50L)
Unit: milliseconds
     expr       min        lq      mean    median       uq      max
 baptiste 185.86828 199.57517 231.28246 244.81980 261.7451 271.3426
 plafort2  49.03265  54.30729  84.88076  66.10971 118.3270 123.1074
 neval cld
    50   b
    50  a

Addlepated answered 20/10, 2015 at 0:53 Comment(9)

I wonder in what the second solution is different from OP's one. – Reference 20/10, 2015 at 1:6

It's close, but I think it's different enough to possibly add value. – Addlepated 20/10, 2015 at 1:8

you're looping over x, while I was looping over ref, so it's different. In my particular case length(ref) << length(x), so if a loop must be used it's probably better to do it my way. – Tantara 20/10, 2015 at 1:9

fair enough, I guess the function call is what makes the difference. Thanks. – Tantara 20/10, 2015 at 1:45

Frank's answer is great. What function call? – Addlepated 20/10, 2015 at 1:47

You have tol as a global var read inside these functions? I guess there's no other way, since you're making a binary operator..? – Mikkel 20/10, 2015 at 1:49

Yes, for the binary operators. Similar performance as argument also. – Addlepated 20/10, 2015 at 1:50

It's actually a bit faster with the arg defined. – Addlepated 20/10, 2015 at 1:51

Ok, interesting. I guess I'd consider doing options( eqtol = 1e-10 ) and grabbing getOption("eqtol") inside the function so as to be less likely to accidentally modify it. (I have no experience actually doing such a thing, though.) – Mikkel 20/10, 2015 at 1:54

Another idea to avoid length(x) * length(ref) searching:

ff = function(x, ref, tol = 1e-10)
{
    sref = sort(ref)
    i = findInterval(x, sref, all.inside = TRUE)
    dif1 = abs(x - sref[i])
    dif2 = abs(x - sref[i + 1])
    dif = dif1 > dif2
    dif1[dif] = dif2[dif] 
    dif1 <= tol
}
ff(c(1.0, 1+1e-13, 1.01, 2, 2+1e-9, 2-1e-11), c(1, 2, 3))
#[1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE

And to compare:

set.seed(911)
X = sample(1e2, 5e5, TRUE) + (sample(c(1e-8, 1e-9, 1e-10, 1e-12, 1e-13), 5e5, TRUE) * sample(c(-1, 1), 5e5, TRUE))
REF = as.double(1:1e2)

all.equal(ff(X, REF), select_in(X, REF))
#[1] TRUE
tol = 1e-10 #set this for Pierre's function
microbenchmark::microbenchmark(select_in(X, REF), fselect_in(X, REF), X %~% REF, ff(X, REF), { round(X, 10); round(REF, 10) }, times = 35)
#Unit: milliseconds
#                                    expr        min         lq     median         uq        max neval
#                       select_in(X, REF) 1259.95876 1324.52371 1380.10492 1428.78677 1495.61810    35
#                      fselect_in(X, REF)  121.47241  123.72678  125.28932  128.56770  142.15676    35
#                               X %~% REF 2023.78159 2088.97226 2161.66973 2219.46164 2547.89849    35
#                              ff(X, REF)   67.35003   69.39804   71.20871   73.22626   94.04477    35
# {     round(X, 10)     round(REF, 10) }   96.20344   96.88344   99.10093  102.66328  117.75189    35

Frank's match should be faster than findInterval, and indeed is, with most time spent in round.

Basketry answered 20/10, 2015 at 13:11 Comment(0)

Recommended topics

Hot tags