Why is the range() function slower than a combination of min and max?
Asked Answered
V

1

12

I came across R's range function. It is for sure a useful tool and makes code more readable, but its speed can be doubled by replacing it with a simple one-liner including min and max.

I did some benchmarks and the 'bad' performance of the range function surprised me. For comparison I wrote a function called range2 which uses min and max (see code). Except for speed, are there any reasons why this function exists if it can be outperformed by a simple one-liner, which is also easily readable?

require(microbenchmark)

range2 <- function(x) c(min(x),max(x))  

n <- 1000000
x <- rnorm(n)
microbenchmark(range(x), range2(x))
#Unit: milliseconds
#  expr      min       lq     mean   median       uq     max neval cld
# range(x) 4.696101 4.734751 5.321603 4.796301 4.814751 23.0646   100   b
#range2(x) 2.477602 2.516101 2.542540 2.535051 2.544052  3.7636   100  a 

n <- 10000000
x <- rnorm(n)
microbenchmark(range(x), range2(x))
# Unit: milliseconds
#  expr     min      lq     mean   median       uq      max neval cld
# range(x) 47.3246 47.9498 58.27992 55.25795 61.98205 146.5100   100   b
#range2(x) 24.7063 25.5021 25.59192 25.55245 25.63515  27.1088   100  a

For sure this would be not the first bottleneck one wants to get rid of, since we are talking about milliseconds on a vector with 10,000,000 entries, but I expected range to be faster. My simple intuition was:

range goes through the data one time and searches for the minimum and maximum at the same time, whereas my range2 function goes through the data two times: One time to find the minimum and one time to find the maximum.

Maybe someone can give some background about the implementation. Maybe the reason is that min and max are implemented in C and range is not?

Addition: I've already talked about that with a friend of mine and he just made this function faster by implementing it in C++ via:

#include <Rcpp.h>
#include <float.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector range3(NumericVector x) {
  int xs = x.size();
  double minValue = FLT_MAX;
  double maxValue = FLT_MIN;
  for (int i =0; i < xs; i++) {
    if (x[i] < minValue) minValue = x[i];
    if (x[i] > maxValue) maxValue = x[i];
  }
  Rcpp::NumericVector result(2);
  result[0] = minValue;
  result[1] = maxValue;
  return result;
}

and this gives the following benchmarks:

n <- 10000000
x <- rnorm(n)
microbenchmark(range(x), range2(x) ,range3(x))
#Unit: milliseconds
#      expr     min       lq     mean  median       uq      max neval cld
#  range(x) 47.8583 48.30355 58.12575 55.3135 62.10295 149.9648   100   c
# range2(x) 24.8211 25.53615 25.90920 25.6176 25.79175  42.4659   100  b 
# range3(x) 13.2458 13.30385 13.47175 13.3797 13.65410  14.3487   100 a
Vegetarianism answered 3/4, 2019 at 15:5 Comment(6)
You can see from range.default that it does a few extra checks before calling c(min(x), max(x)) itself. It's not optimized for speed. It's just a user friendly function. It seems unlikely that those millisecond differences would be the source a performance bottleneck.Bowsprit
Hi, here is how you view the sourse code of range.default: getAnywhere(range.default)Auld
The function definition should also display if you just run the line range.defaultBrassy
Thank you for the answers. This is just a nice example that some knowledge about the data you are working with is helpful, since you can avoid checks which are not needed.Vegetarianism
I have no idea about R, but a wild guess: The internal x[i]<minValue and x[i]>maxValue checks will yield true only for the first few data elements. After that, only "outliers" will yield true. So I would be interested to see a comparison on these functions based on data that is 1. constant, 2. ascending, 3. descending (yeah, we can hear the words "branch prediction fail" sneaking in here...)Lavenialaver
@Bowsprit please put your comment into answer, it is highly voted, still unanswered, question. Worth to have it marked as resolved.Denticulation
B
3

Here's the source for range.default (run R 3.6.1)

 > range.default
function (..., na.rm = FALSE, finite = FALSE) 
{
    x <- c(..., recursive = TRUE)
    if (is.numeric(x)) {
        if (finite) 
            x <- x[is.finite(x)]
        else if (na.rm) 
            x <- x[!is.na(x)]
        c(min(x), max(x))
    }
    else {
        if (finite) 
            na.rm <- TRUE
        c(min(x, na.rm = na.rm), max(x, na.rm = na.rm))
    }
}

You can see that it does a few extra checks before calling c(min(x), max(x)) itself. It's not optimized for speed. It's just a user friendly function. It seems unlikely that those millisecond differences would be the source a performance bottleneck.

Bowsprit answered 2/3, 2020 at 5:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.