findInterval() with right-closed intervals
Asked Answered
S

3

20

The great findInterval() function in R uses left-closed sub-intervals in its vec argument, as shown in its docs:

if i <- findInterval(x,v), we have v[i[j]] <= x[j] < v[i[j] + 1]

If I want right-closed sub-intervals, what are my options? The best I've come up with is this:

findInterval.rightClosed <- function(x, vec, ...) {
  fi <- findInterval(x, vec, ...)
  fi - (x==vec[fi])
}

Another one also works:

findInterval.rightClosed2 <- function(x, vec, ...) {
  length(vec) - findInterval(-x, -rev(vec), ...)
}

Here's a little test:

x <- c(3, 6, 7, 7, 29, 37, 52)
vec <- c(2, 5, 6, 35)
findInterval(x, vec)
# [1] 1 3 3 3 3 4 4
findInterval.rightClosed(x, vec)
# [1] 1 2 3 3 3 4 4
findInterval.rightClosed2(x, vec)
# [1] 1 2 3 3 3 4 4

But I'd like to see any other solutions if there's a better one. By "better", I mean "somehow more satisfying" or "doesn't feel like a kludge" or maybe even "more efficient". =)

(Note that there's a rightmost.closed argument to findInterval(), but it's different - it only refers to the final sub-interval and has a different meaning.)

Sleeve answered 20/11, 2012 at 21:54 Comment(6)
What do you think about: findInterval(x, c(-Inf, head(vec, -1)))?Indefeasible
@Indefeasible that doesn't seem to do the trick, I added an example and yours doesn't give the same result.Sleeve
I'm a bit befuddled here, but does findInterval(x-1,vec) do what you are looking for?Hemiterpene
@thelatemail, as long as it's just integers?Benner
I think it is ok to post answers to your own question. My vote would go to your findInterval.rightClosed2.Abstemious
Another desirable property would be that it works with types other than numeric (e.g. POSIXt objects, but ideally anything with comparison operators), as findInterval does (findInterval works with anything that supports an as.double method). The first function above succeeds, but the second fails, because there's no way to negate POSIXt objects.Sleeve
B
10

EDIT: Major clean-up in all aisles.

You might look at cut. By default, cut makes left open and right closed intervals, and that can be changed using the appropriate argument (right). To use your example:

x <- c(3, 6, 7, 7, 29, 37, 52)
vec <- c(2, 5, 6, 35)
cutVec <- c(vec, max(x)) # for cut, range of vec should cover all of x

Now create four functions that should do the same thing: Two from the OP, one from Josh O'Brien, and then cut. Two arguments to cut have been changed from default settings: include.lowest = TRUE will create an interval closed on both sides for the smallest (leftmost) interval. labels = FALSE will cause cut to return simply the integer values for the bins instead of creating a factor, which it otherwise does.

findInterval.rightClosed <- function(x, vec, ...) {
  fi <- findInterval(x, vec, ...)
  fi - (x==vec[fi])
}
findInterval.rightClosed2 <- function(x, vec, ...) {
  length(vec) - findInterval(-x, -rev(vec), ...)
}
cutFun <- function(x, vec){
    cut(x, vec, include.lowest = TRUE, labels = FALSE)
}
# The body of fiFun is a contribution by Josh O'Brien that got fed to the ether.
fiFun <- function(x, vec){
    xxFI <- findInterval(x, vec * (1 + .Machine$double.eps))
}

Do all functions return the same result? Yup. (notice the use of cutVec for cutFun)

mapply(identical, list(findInterval.rightClosed(x, vec)),
  list(findInterval.rightClosed2(x, vec), cutFun(x, cutVec), fiFun(x, vec)))
# [1] TRUE TRUE TRUE

Now a more demanding vector to bin:

x <- rpois(2e6, 10)
vec <- c(-Inf, quantile(x, seq(.2, 1, .2)))

Test whether identical (note use of unname)

mapply(identical, list(unname(findInterval.rightClosed(x, vec))),
  list(findInterval.rightClosed2(x, vec), cutFun(x, vec), fiFun(x, vec)))
# [1] TRUE TRUE TRUE

And benchmark:

library(microbenchmark)
microbenchmark(findInterval.rightClosed(x, vec), findInterval.rightClosed2(x, vec),
  cutFun(x, vec), fiFun(x, vec), times = 50)
# Unit: milliseconds
#                                expr       min        lq    median        uq       max
# 1                    cutFun(x, vec)  35.46261  35.63435  35.81233  36.68036  53.52078
# 2                     fiFun(x, vec)  51.30158  51.69391  52.24277  53.69253  67.09433
# 3  findInterval.rightClosed(x, vec) 124.57110 133.99315 142.06567 155.68592 176.43291
# 4 findInterval.rightClosed2(x, vec)  79.81685  82.01025  86.20182  95.65368 108.51624

From this run, cut seems to be the fastest.

Benner answered 20/11, 2012 at 22:3 Comment(10)
Thanks. I seem to recall that cut is less efficient, and it also doesn't allow trailing off the right edge of vec like findInterval does (see rows 15-17 in your output), but that part could be worked around by adding an Inf at the end.Sleeve
@KenWilliams, yeah, covering the whole range of x with breaks allows you to cut up all of x (see my edit, too). As to the efficiency, well, the code already exists at least.Benner
@KenWilliams, if you count speed in efficiency benchmarking, there might not be all that much of a difference between findInterval and cut after all. Perhaps the slowdown in cut usually comes from conversion to a factor and label-making?Benner
For completeness, could you include the OP's solutions as well in your timings...?Hoberthobey
@KenWilliams, if you wish, please see the major update above with (speed) benchmarking.Benner
Surprising results! I get similar numbers on my machine. It looks like I should switch my code to cutFun. And maybe the docs for cut need to be updated to remove the disclaimer about speed.Sleeve
@KenWilliams, I was surprised, too. And I just found news(grepl("^PERF", Category) & grepl("cut",Text)), so the docs very well might be out of date.Benner
One subtle difference to note: findInterval only requires that the cutpoints are non-decreasing, but cut requires that they're strictly increasing. Working around that would probably make cutFun lose the speed battle.Sleeve
@eddi, could you provide an example of fiFun failing? The couple of examples I ran seemed to return expected results.Benner
I still prefer OP's solutions because of e.g. this thread: #16185447Scalar
R
4

Maybe you can use the option left.open:

findInterval(x, vec, left.open=T)
[1] 1 2 3 3 3 4 4
Runofthemill answered 23/10, 2017 at 16:33 Comment(1)
Yes - this option was added in R version 3.3.2, released in October 2016. It's now the right way to do this.Sleeve
B
-1

If your limits are intervals you simply can grow the right interval a bit: interval+c(0,0.1) would do: findinterval(value, interval+c(0,0.1))

Bluenose answered 15/8, 2016 at 14:38 Comment(1)
This won't work for a variety of reasons. Most fundamentally, recycling will cause this to change the right interval on every odd entry, not all of them. Further, this assumes that you know that the discretization of the data is wider than 0.1.Draftee

© 2022 - 2024 — McMap. All rights reserved.