Why sort is slower than order function in R?
Asked Answered
V

2

5

All is in the title. I would expect that order uses sort to find the order of the values in a vector. Thus sort should be quicker than order to sort a vector, but this is not the case:

library(microbenchmark)
ss=sample(100,10000,replace=T)
microbenchmark(sort(ss))
microbenchmark(ss[order(ss)])

result:

> microbenchmark(sort(ss))
Unit: microseconds
    expr     min       lq     mean  median       uq      max neval
 sort(ss) 141.535 144.6415 173.6581 146.358 150.2295 2531.762   100
> microbenchmark(ss[order(ss)])
Unit: microseconds
        expr     min       lq     mean  median       uq     max neval
 ss[order(ss)] 109.198 110.9865 115.6275 111.901 115.3655 197.204   100

Example with a larger vector:

ss=sample(100,1e8,replace=T)
microbenchmark(sort(ss), ss[order(ss)], times = 5)
# Unit: seconds
#           expr      min       lq     mean   median       uq      max neval
#       sort(ss) 5.427966 5.431971 5.892629 6.049515 6.207060 6.346633     5
#  ss[order(ss)] 3.381253 3.500134 3.562048 3.518079 3.625778 3.784997     5
Vivienviviene answered 20/6, 2018 at 17:58 Comment(3)
note that your timings are on the order of microseconds. please set the size of the sample to at least 1 million (preferably 10 million or 100 million) and update the timingsTrenttrento
@Trenttrento both of these functions use (the same) O(n) algorithm for integer vectors, so you're not going to see much more separation with larger inputs.Organza
Thank @Ryan for the update with larger example. Is clear then than sort keep being aroun 1.6 times slower that order. But as suggested by @Dan Hall in his answer, this ratio shrink to 1.15 when using directly sort.int.Vivienviviene
R
5

The treatment of NA values under the default arguments is different. In sort, the entire vector must be scanned for NA values, which are then removed; in order, they are simply put last. When the argument sort.last = TRUE is used in both, the performance is basically identical.

ss=sample(100,1e8,replace=T) 
bench::mark(sort(ss), ss[order(ss)], sort(ss, na.last = TRUE))
# A tibble: 3 x 14
  expression    min   mean median    max `itr/sec` mem_alloc  n_gc n_itr total_time result
  <chr>      <bch:> <bch:> <bch:> <bch:>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list>
1 sort(ss)   2.610s 2.610s 2.610s 2.610s     0.383 762.940MB     0     1     2.610s <int ~
2 ss[order(~ 1.597s 1.597s 1.597s 1.597s     0.626 762.940MB     0     1     1.597s <int ~
3 sort(ss, ~ 1.592s 1.592s 1.592s 1.592s     0.628 762.940MB     0     1     1.592s <int ~
# ... with 3 more variables: memory <list>, time <list>, gc <list>
Reitman answered 16/8, 2019 at 15:37 Comment(2)
This is an excellent answer and should have the green checkmark. If I could take it away from mine and give it to you I would. Thanks for coming back to this a year later.Organza
I just saw the answer and changed the accepted answer. Thanks to both of you!Vivienviviene
O
5

because sort.default() uses order (rather than the other way around).

function (x, decreasing = FALSE, na.last = NA, ...) 
{
  if (is.object(x)) 
    x[order(x, na.last = na.last, decreasing = decreasing)]
  else sort.int(x, na.last = na.last, decreasing = decreasing, 
    ...)
}

sort has to determine its method, then execute the same x[order(x)] call you're executing in one step when you use x[order(x)] directly. You can ramp up the size of the input as much as you want. For an integer vector, x[order(x)] should always outperform sort(x).

@Hugh's answer a year later demonstrates that the bulk of the difference is in the default treatment of NA values. It should be the accepted answer here.

Organza answered 20/6, 2018 at 18:12 Comment(4)
just want to be clear I wasn't saying the timing would flip, just that it would not look "strikingly" different once the timings see actually meaningful.Trenttrento
This certainly proves that sort won’t be generally faster than order but it seems incomplete. Determining the method and executing is.object should only take microseconds and shouldn’t be too dependent on the length of x.Reitman
@Reitman class determination and method selection end up being a significant part of the computational cost throughout R. One of the first steps to take when improving efficiency in R is to select a method directly.Organza
I don't deny it's significant, but it's the minority of the difference: sort 2.7s, sort.int: 2.6s, x[order(x)]: 1.5s. Something else is going on.Reitman
R
5

The treatment of NA values under the default arguments is different. In sort, the entire vector must be scanned for NA values, which are then removed; in order, they are simply put last. When the argument sort.last = TRUE is used in both, the performance is basically identical.

ss=sample(100,1e8,replace=T) 
bench::mark(sort(ss), ss[order(ss)], sort(ss, na.last = TRUE))
# A tibble: 3 x 14
  expression    min   mean median    max `itr/sec` mem_alloc  n_gc n_itr total_time result
  <chr>      <bch:> <bch:> <bch:> <bch:>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list>
1 sort(ss)   2.610s 2.610s 2.610s 2.610s     0.383 762.940MB     0     1     2.610s <int ~
2 ss[order(~ 1.597s 1.597s 1.597s 1.597s     0.626 762.940MB     0     1     1.597s <int ~
3 sort(ss, ~ 1.592s 1.592s 1.592s 1.592s     0.628 762.940MB     0     1     1.592s <int ~
# ... with 3 more variables: memory <list>, time <list>, gc <list>
Reitman answered 16/8, 2019 at 15:37 Comment(2)
This is an excellent answer and should have the green checkmark. If I could take it away from mine and give it to you I would. Thanks for coming back to this a year later.Organza
I just saw the answer and changed the accepted answer. Thanks to both of you!Vivienviviene

© 2022 - 2024 — McMap. All rights reserved.