Fastest way to find second (third...) highest/lowest value in vector or column
Asked Answered
N

16

195

R offers max and min, but I do not see a really fast way to find another value in the order, apart from sorting the whole vector and then picking a value x from this vector.

Is there a faster way to get the second highest value, for example?

Neuron answered 16/3, 2010 at 9:54 Comment(4)
The package kit on CRAN has a topn function which is faster than sort, order and nth. Look at the documentation.Englacial
@Englacial could you provide examples benchmarking it against the examples provided by Rfast::nth? If it realy is faster when fairly compared to Rfast::nth then it should be the accepted answerDictatorship
@Stefanos, I posted the benchmark below ...based on your benchmarkEnglacial
I just did a second run with kit::topn(hasna=F)...I believe I provided the best answer now, didn't I?Englacial
D
47

Rfast has a function called nth_element that does exactly what you ask.

Further the methods discussed above that are based on partial sort, don't support finding the k smallest values

Update (28/FEB/21) package kit offers a faster implementation (topn) see https://mcmap.net/q/128031/-fastest-way-to-find-second-third-highest-lowest-value-in-vector-or-column, https://mcmap.net/q/130112/-fastest-way-to-find-the-index-of-the-second-third-highest-lowest-value-in-vector-or-column

Disclaimer: An issue appears to occur when dealing with integers which can by bypassed by using as.numeric (e.g. Rfast::nth(as.numeric(1:10), 2)), and will be addressed in the next update of Rfast.

Rfast::nth(x, 5, descending = T)

Will return the 5th largest element of x, while

Rfast::nth(x, 5, descending = F)

Will return the 5th smallest element of x

Benchmarks below against most popular answers.

For 10 thousand numbers:

N = 10000
x = rnorm(N)

maxN <- function(x, N=2){
    len <- length(x)
    if(N>len){
        warning('N greater than length(x).  Setting N=length(x)')
        N <- length(x)
    }
    sort(x,partial=len-N+1)[len-N+1]
}

microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxn = maxN(x,5),
order = x[order(x, decreasing = T)[5]])

Unit: microseconds
  expr      min       lq      mean   median        uq       max neval
 Rfast  160.364  179.607  202.8024  194.575  210.1830   351.517   100
  maxN  396.419  423.360  559.2707  446.452  487.0775  4949.452   100
 order 1288.466 1343.417 1746.7627 1433.221 1500.7865 13768.148   100

For 1 million numbers:

N = 1e6
x = rnorm(N)

microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]]) 

Unit: milliseconds
  expr      min        lq      mean   median        uq       max neval
 Rfast  89.7722  93.63674  114.9893 104.6325  120.5767  204.8839   100
  maxN 150.2822 207.03922  235.3037 241.7604  259.7476  336.7051   100
 order 930.8924 968.54785 1005.5487 991.7995 1031.0290 1164.9129   100
Dictatorship answered 4/11, 2018 at 19:48 Comment(5)
Nice! Normally when I see a relatively low-rep user add an answer to a popular old question it's pretty low-quality. This, on the other hand, is an excellent addition. I made a couple readability edits, but it looks great!Trotman
It bears mentioning that Rfast::nth can return multiple elements (e.g. 8th and 9th largest elements) as well as the indices of those elements.Burnsed
What I like about the Rfast solution is that the package also has an easily implemented solution for doing this for each row or column.Karakorum
There is a bug in nth for integer values. I know it and I will fix it for future update of package. For now you can just use Rfast::nth(as.numeric(1:10), 2). Although, I don't really think that Rfast::nth(1:10, 2) is a great example. If you have a sorted array why do you want to use nth? It is a lot faster to check if it is sorted and then extract the value or even better extract the value itself.Asserted
This solution was the best for my needs, but I'm wondering if there's a way to return NA if there isn't a 3rd or 4th highest value. I have a table where that's the case.Pisciform
I
207

Use the partial argument of sort(). For the second highest value:

n <- length(x)
sort(x,partial=n-1)[n-1]
Iodize answered 16/3, 2010 at 10:41 Comment(6)
What is the advantage of this method as opposed to sort(x, TRUE)[2] as described in @Abrar's answer, apart from not satisfying the constraint in the question?Lengel
I used this method, but get the following error: Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) : index 4705 outside bounds Any idea what might the issue be? Some details: My x is a numeric vector of length 4706 with some NAs in the data. I tried to get the second highest value in the vector using the exact same code as @RobHyndman suggested.Gifford
Why don't you sort descending and take the second of only two values? Wouldn't this be faster?Koloski
The descreasing argument is not compatible with partial sorting.Iodize
@Gifford I know you asked this 3.5 years ago, but this solution won't work with missing values because sort removes the missing values. One fix would be n <- sum(!is.na(x)); sort(x,partial=n-1)[n-1]Etienne
Though the decreasing argument is not compatible with partial sorting, you could always -sort(-x, partial=n-1)[n-1]; it is logically the same thing and takes considerably less time than sort(x, decreasing=TRUE)[n-1].Sacramentarian
A
65

Slightly slower alternative, just for the records:

x <- c(12.45,34,4,0,-234,45.6,4)
max( x[x!=max(x)] )
min( x[x!=min(x)] )
Autry answered 16/3, 2010 at 11:49 Comment(4)
It would seem surprising if this was any faster than sorting the whole vector and taking the n-1th value!Koloski
@Koloski This is O(n) so it has to be faster than sorting on large datasets.Habitant
It seems to me you can get some considerable speed improvement with a small modification: max(x[-which.max(x)])Crock
This answer produces an error if all values are the same, unless you use @sindri_baldur's answer (and there are at least 2 items, of course)Xerophthalmia
D
47

Rfast has a function called nth_element that does exactly what you ask.

Further the methods discussed above that are based on partial sort, don't support finding the k smallest values

Update (28/FEB/21) package kit offers a faster implementation (topn) see https://mcmap.net/q/128031/-fastest-way-to-find-second-third-highest-lowest-value-in-vector-or-column, https://mcmap.net/q/130112/-fastest-way-to-find-the-index-of-the-second-third-highest-lowest-value-in-vector-or-column

Disclaimer: An issue appears to occur when dealing with integers which can by bypassed by using as.numeric (e.g. Rfast::nth(as.numeric(1:10), 2)), and will be addressed in the next update of Rfast.

Rfast::nth(x, 5, descending = T)

Will return the 5th largest element of x, while

Rfast::nth(x, 5, descending = F)

Will return the 5th smallest element of x

Benchmarks below against most popular answers.

For 10 thousand numbers:

N = 10000
x = rnorm(N)

maxN <- function(x, N=2){
    len <- length(x)
    if(N>len){
        warning('N greater than length(x).  Setting N=length(x)')
        N <- length(x)
    }
    sort(x,partial=len-N+1)[len-N+1]
}

microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxn = maxN(x,5),
order = x[order(x, decreasing = T)[5]])

Unit: microseconds
  expr      min       lq      mean   median        uq       max neval
 Rfast  160.364  179.607  202.8024  194.575  210.1830   351.517   100
  maxN  396.419  423.360  559.2707  446.452  487.0775  4949.452   100
 order 1288.466 1343.417 1746.7627 1433.221 1500.7865 13768.148   100

For 1 million numbers:

N = 1e6
x = rnorm(N)

microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]]) 

Unit: milliseconds
  expr      min        lq      mean   median        uq       max neval
 Rfast  89.7722  93.63674  114.9893 104.6325  120.5767  204.8839   100
  maxN 150.2822 207.03922  235.3037 241.7604  259.7476  336.7051   100
 order 930.8924 968.54785 1005.5487 991.7995 1031.0290 1164.9129   100
Dictatorship answered 4/11, 2018 at 19:48 Comment(5)
Nice! Normally when I see a relatively low-rep user add an answer to a popular old question it's pretty low-quality. This, on the other hand, is an excellent addition. I made a couple readability edits, but it looks great!Trotman
It bears mentioning that Rfast::nth can return multiple elements (e.g. 8th and 9th largest elements) as well as the indices of those elements.Burnsed
What I like about the Rfast solution is that the package also has an easily implemented solution for doing this for each row or column.Karakorum
There is a bug in nth for integer values. I know it and I will fix it for future update of package. For now you can just use Rfast::nth(as.numeric(1:10), 2). Although, I don't really think that Rfast::nth(1:10, 2) is a great example. If you have a sorted array why do you want to use nth? It is a lot faster to check if it is sorted and then extract the value or even better extract the value itself.Asserted
This solution was the best for my needs, but I'm wondering if there's a way to return NA if there isn't a 3rd or 4th highest value. I have a table where that's the case.Pisciform
P
31

I wrapped Rob's answer up into a slightly more general function, which can be used to find the 2nd, 3rd, 4th (etc.) max:

maxN <- function(x, N=2){
  len <- length(x)
  if(N>len){
    warning('N greater than length(x).  Setting N=length(x)')
    N <- length(x)
  }
  sort(x,partial=len-N+1)[len-N+1]
}

maxN(1:10)
Phototype answered 8/1, 2014 at 19:47 Comment(2)
Cool. This usage is particularly useful maxN(1:10, 1:3) (I would have set the default N to 1)Aztec
Why not have the main line in the fx as sort(x, decreasing=T, partial=N)[N]?Nickname
T
16

Here is an easy way to find the indices of N smallest/largest values in a vector(Example for N = 3):

N <- 3

N Smallest:

ndx <- order(x)[1:N]

N Largest:

ndx <- order(x, decreasing = T)[1:N]

So you can extract the values as:

x[ndx]
Tangible answered 26/9, 2013 at 14:52 Comment(3)
This runs in L log L time, where L is the length of x. I think the user was hoping for a method that runs in log L time.Chaves
This might be the second fastest way if the methods were ordered by time and the fastest N extracted. I also like it because it is very clear code compared to the accepted solution.Bloxberg
The theoretical best and the accepted method (hopefully) runs in O(L) time, not O(log L). This one runs in O(L log L).Cas
A
11

For nth highest value,

sort(x, TRUE)[n]
Amador answered 15/12, 2011 at 7:13 Comment(2)
The OP already said in his post that this was a solution he did not want to use: "apart from sorting the whole vector and than picking value x from this vector".Yearly
Handy as one can easily grab the three (four, whatever) highest sort(x, TRUE)[1:3]Skiest
E
9

Here you go... kit is the obvious winner!

N = 1e6
x = rnorm(N)

maxN <- function(x, N=2){
  len <- length(x)
  if(N>len){
    warning('N greater than length(x).  Setting N=length(x)')
    N <- length(x)
  }
  sort(x,partial=len-N+1)[len-N+1]
}

microbenchmark::microbenchmark(
  Rfast = Rfast::nth(x,5,descending = T),
  maxN = maxN(x,5),
  order = x[order(x, decreasing = T)[5]],
  kit = x[kit::topn(x, 5L,decreasing = T)[5L]]
) 
# Unit: milliseconds
# expr       min        lq     mean    median        uq        max neval
# Rfast 12.311168 12.473771 16.36982 12.702134 16.110779 102.749873   100
# maxN  12.922118 13.124358 17.49628 18.977537 20.053139  28.928694   100
# order 50.443100 50.926975 52.54067 51.270163 52.323116  66.561606   100
# kit    1.177202  1.216371  1.29542  1.240228  1.297286   2.771715   100

Edit: I forgot that kit::topn has hasna option...let's do another run.

microbenchmark::microbenchmark(
  Rfast = Rfast::nth(x,5,descending = T),
  maxN = maxN(x,5),
  order = x[order(x, decreasing = T)[5]],
  kit = x[kit::topn(x, 5L,decreasing = T)[5L]],
  kit2 = x[kit::topn(x, 5L,decreasing = T,hasna = F)[5L]],
  unit = "ms"
) 
# Unit: milliseconds
# expr       min        lq       mean     median        uq       max neval
# Rfast 13.194314 13.358787 14.7227116 13.4560340 14.551194 24.524105   100
# maxN   7.378960  7.527661 10.0747803  7.7119715 12.217756 67.409526   100
# order 50.088927 50.488832 52.4714347 50.7415680 52.267003 70.062662   100
# kit    1.180698  1.217237  1.2975441  1.2429790  1.278243  3.263202   100
# kit2   0.842354  0.876329  0.9398055  0.9109095  0.944407  2.135903   100
Englacial answered 25/2, 2021 at 11:55 Comment(2)
You can add index=F to kit::topn to return values instead of indexes: topn(x,5,hasna=F,index=F)[5]. And when I added x2=x;for(i in 1:4)x2[which.max(x2)]=NA;max(x2,na.rm=T) to your benchmark, it was about 50% faster than maxN and about 2.5 times faster than order.Mcdaniels
This should be the accepted answer. kit is blazingly fast and you can return either the actual values of the vector or the indices. Really great solution.Cyclosis
E
5

Here is the simplest way I found,

num <- c(5665,1615,5154,65564,69895646)

num <- sort(num, decreasing = F)

tail(num, 1)                           # Highest number
head(tail(num, 2),1)                   # Second Highest number
head(tail(num, 3),1)                   # Third Highest number
head(tail(num, n),1)                   # Generl equation for finding nth Highest number
Emptor answered 8/1, 2020 at 3:58 Comment(0)
T
3

I found that removing the max element first and then do another max runs in comparable speed:

system.time({a=runif(1000000);m=max(a);i=which.max(a);b=a[-i];max(b)})
   user  system elapsed 
  0.092   0.000   0.659 

system.time({a=runif(1000000);n=length(a);sort(a,partial=n-1)[n-1]})
   user  system elapsed 
  0.096   0.000   0.653 
Tusker answered 23/10, 2013 at 19:3 Comment(0)
M
2

dplyr has the function nth, where the first argument is the vector and the second is which place you want. This goes for repeating elements as well. For example:

x = c(1,2, 8, 16, 17, 20, 1, 20)

Finding the second largest value:

 nth(unique(x),length(unique(x))-1)

[1] 17
Maverick answered 8/2, 2018 at 14:51 Comment(4)
is this fast ... ?Melonie
internally this uses x[[order(order_by)[[n]]]] - so it requires sorting the whole vector. So it won't be as fast as the accepted answer.Melonie
but it uses sort with the partial= argument (which changes everything)Melonie
@BenBolker which implies Paolo's or Rob's answer could be used to improve dplyr::nth()? bench::mark(max(x[-which.max(x)]), x[[order(-x)[[2]]]] ), nth()seems almost 10 times slower, where length(x) is 3 million.Crock
R
1

When I was recently looking for an R function returning indexes of top N max/min numbers in a given vector, I was surprised there is no such a function.

And this is something very similar.

The brute force solution using base::order function seems to be the easiest one.

topMaxUsingFullSort <- function(x, N) {
  sort(x, decreasing = TRUE)[1:min(N, length(x))]
}

But it is not the fastest one in case your N value is relatively small compared to length of the vector x.

On the other side if the N is really small, you can use base::whichMax function iteratively and in each iteration you can replace found value by -Inf

# the input vector 'x' must not contain -Inf value 
topMaxUsingWhichMax <- function(x, N) {
  vals <- c()
  for(i in 1:min(N, length(x))) {
    idx      <- which.max(x)
    vals     <- c(vals, x[idx]) # copy-on-modify (this is not an issue because idxs is relative small vector)
    x[idx]   <- -Inf            # copy-on-modify (this is the issue because data vector could be huge)
  }
  vals
}

I believe you see the problem - the copy-on-modify nature of R. So this will perform better for very very very small N (1,2,3) but it will rapidly slow down for larger N values. And you are iterating over all elements in vector x N times.

I think the best solution in clean R is to use partial base::sort.

topMaxUsingPartialSort <- function(x, N) {
  N <- min(N, length(x))
  x[x >= -sort(-x, partial=N)[N]][1:N]
}

Then you can select the last (Nth) item from the result of functions defiend above.

Note: functions defined above are just examples - if you want to use them, you have to check/sanity inputs (eg. N > length(x)).

I wrote a small article about something very similar (get indexes of top N max/min values of a vector) at http://palusga.cz/?p=18 - you can find here some benchmarks of similar functions I defined above.

Retired answered 5/2, 2015 at 23:48 Comment(0)
U
1

head(sort(x),..) or tail(sort(x),...) should work

Undertrick answered 17/3, 2015 at 16:30 Comment(0)
C
1

This will find the index of the N'th smallest or largest value in the input numeric vector x. Set bottom=TRUE in the arguments if you want the N'th from the bottom, or bottom=FALSE if you want the N'th from the top. N=1 and bottom=TRUE is equivalent to which.min, N=1 and bottom=FALSE is equivalent to which.max.

FindIndicesBottomTopN <- function(x=c(4,-2,5,-77,99),N=1,bottom=FALSE)
{

  k1 <- rank(x)
  if(bottom==TRUE){
    Nindex <- which(k1==N)
    Nindex <- Nindex[1]
  }

  if(bottom==FALSE){
    Nindex <- which(k1==(length(x)+1-N))
    Nindex <- Nindex[1]
  }

  return(Nindex)
}
Caldarium answered 31/5, 2017 at 14:4 Comment(0)
F
0
topn = function(vector, n){
  maxs=c()
  ind=c()
  for (i in 1:n){
    biggest=match(max(vector), vector)
    ind[i]=biggest
    maxs[i]=max(vector)
    vector=vector[-biggest]
  }
  mat=cbind(maxs, ind)
  return(mat)
}

this function will return a matrix with the top n values and their indices. hope it helps VDevi-Chou

Fagaly answered 6/6, 2016 at 13:41 Comment(0)
M
-1

You can identify the next higher value with cummax(). If you want the location of the each new higher value for example you can pass your vector of cummax() values to the diff() function to identify locations at which the cummax() value changed. say we have the vector

v <- c(4,6,3,2,-5,6,8,12,16)
cummax(v) will give us the vector
4  6  6  6  6  6  8 12 16

Now, if you want to find the location of a change in cummax() you have many options I tend to use sign(diff(cummax(v))). You have to adjust for the lost first element because of diff(). The complete code for vector v would be:

which(sign(diff(cummax(v)))==1)+1
Mulford answered 18/3, 2016 at 16:11 Comment(1)
I think you misunderstand the question. The goal is to find, say, the second highest value. How does this help to get you from v to 12... and for the third highest to 8?Ashliashlie
N
-1

You can use the sort keyword like this:

sort(unique(c))[1:N]

Example:

c <- c(4,2,44,2,1,45,34,2,4,22,244)
sort(unique(c), decreasing = TRUE)[1:5]

will give the first 5 max numbers.

Notification answered 14/6, 2016 at 10:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.