Rank items in an array using Python/NumPy, without sorting array twice

Asked 12/3, 2011 at 18:52 Answered 22/1, 2020 at 16:16

160

I have an array of numbers and I'd like to create another array that represents the rank of each item in the first array. I'm using Python and NumPy.

For example:

array = [4,2,7,1]
ranks = [2,1,3,0]

Here's the best method I've come up with:

array = numpy.array([4,2,7,1])
temp = array.argsort()
ranks = numpy.arange(len(array))[temp.argsort()]

Are there any better/faster methods that avoid sorting the array twice?

Syringa answered 12/3, 2011 at 18:52 Comment(2)

Your last line is equivalent to ranks = temp.argsort(). – Unisexual 12/3, 2011 at 19:3

This method doesn't work when there is a tie in data – Teteak 8/11, 2022 at 19:8

107

Use advanced indexing on the left-hand side in the last step:

array = numpy.array([4,2,7,1])
temp = array.argsort()
ranks = numpy.empty_like(temp)
ranks[temp] = numpy.arange(len(array))

This avoids sorting twice by inverting the permutation in the last step.

Unisexual answered 12/3, 2011 at 19:1 Comment(6)

Perfect, thank you! I knew there was a solution and it would seem obvious once I saw it. I did some testing with timeit, and this method is slightly slower for small arrays. On my machine they're equal when the array has 2,000 elements. At 20,000 elements, your method is about 25% faster. – Syringa 12/3, 2011 at 20:33

any recommendation on how to do this rowwise? – Sharpedged 19/11, 2017 at 14:9

For more than 1 dim see answer below. – Iddo 22/1, 2020 at 16:17

Just a minor correction, the indirection ranks[temp] isn't technically slicing, but indexing (see numpy.org/doc/stable/reference/arrays.indexing.html) – Cryotherapy 19/1, 2021 at 17:38

Looks a bit like the implementation of scipy.stats.rankdata (for ordinal rank). – Submerge 16/6, 2021 at 17:14

This method doesn't work when there is a tie in data – Teteak 8/11, 2022 at 19:10

145

Use argsort twice, first to obtain the order of the array, then to obtain ranking:

array = numpy.array([4,2,7,1])
order = array.argsort()
ranks = order.argsort()

When dealing with 2D (or higher dimensional) arrays, be sure to pass an axis argument to argsort to order over the correct axis.

Specialist answered 7/6, 2011 at 14:12 Comment(10)

Note that if numbers are repeated in your input array (eg. [4,2,7,1,1]) the output will rank those numbers based on their array position ([3,2,4,0,1]) – Revert 8/6, 2011 at 1:53

Sorting twice is inefficient. @Sven Marnach's answer shows how to accomplish the ranking with a single call to argsort. – Wrath 13/7, 2015 at 15:13

@WarrenWeckesser: I just tested the difference between the two, and you're right for large arrays, but for anything smaller (n < 100), double argsort is faster (about 20% faster for n=100, and about 5 times faster for n=10). So if you have to do a lot of rankings across lots of small sets of values, this method is much better. – Desex 18/8, 2015 at 2:49

That's a good point--I see roughly the same effect. For large enough input, the improved efficiency wins, but for small inputs, the double argsort apparently has less overhead. – Wrath 18/8, 2015 at 3:3

@WarrenWeckesser: Actually, I'm wrong, this method is hands-down better. Both methods are far faster than the scipy.stats method, too. Results: gist.github.com/naught101/14042d91a2d0f18a6ae4 – Desex 18/8, 2015 at 3:3

@naught101: There is a bug in your script. The line array = np.random.rand(10) should be array = np.random.rand(n). – Wrath 18/8, 2015 at 3:25

Wouldn't the following also be equivalent without having to sort twice? array = numpy.array([4,2,7,1]); order = array.argsort(); ranks = order[order] – Boride 18/12, 2015 at 17:51

I couldn't find an explanation as to why double argsort works - I hope this helps berkayantmen.com/rank – Orbadiah 20/10, 2019 at 21:57

Doesn't work on an input with repeated elements. – Inquietude 9/1, 2022 at 15:42

This method doesn't work when there is a tie in data – Teteak 8/11, 2022 at 19:8

132

This question is a few years old, and the accepted answer is great, but I think the following is still worth mentioning. If you don't mind the dependence on scipy, you can use scipy.stats.rankdata:

In [22]: from scipy.stats import rankdata

In [23]: a = [4, 2, 7, 1]

In [24]: rankdata(a)
Out[24]: array([ 3.,  2.,  4.,  1.])

In [25]: (rankdata(a) - 1).astype(int)
Out[25]: array([2, 1, 3, 0])

A nice feature of rankdata is that the method argument provides several options for handling ties. For example, there are three occurrences of 20 and two occurrences of 40 in b:

In [26]: b = [40, 20, 70, 10, 20, 50, 30, 40, 20]

The default assigns the average rank to the tied values:

In [27]: rankdata(b)
Out[27]: array([ 6.5,  3. ,  9. ,  1. ,  3. ,  8. ,  5. ,  6.5,  3. ])

method='ordinal' assigns consecutive ranks:

In [28]: rankdata(b, method='ordinal')
Out[28]: array([6, 2, 9, 1, 3, 8, 5, 7, 4])

method='min' assigns the minimum rank of the tied values to all the tied values:

In [29]: rankdata(b, method='min')
Out[29]: array([6, 2, 9, 1, 2, 8, 5, 6, 2])

See the docstring for more options.

Wrath answered 15/3, 2015 at 11:15 Comment(4)

yeah, this is the best answer anywhere where edge cases are important. – Desex 20/10, 2015 at 2:54

I find it interesting that rankdata seems to use the same mechanism as the accepted answer to generate the initial ranking internally. – Iqbal 17/5, 2019 at 15:35

This is a great answer, it also works nice for my pandas DataFrames where I want to rank elements in each row of a 2D DataFrame but specifying axis=0 in rankdata() and handle ties at the same time. – Antecedents 1/12, 2022 at 9:23

DataFrame.rank() is faster. – Hyperactive 5/1 at 15:0

107

Use advanced indexing on the left-hand side in the last step:

array = numpy.array([4,2,7,1])
temp = array.argsort()
ranks = numpy.empty_like(temp)
ranks[temp] = numpy.arange(len(array))

This avoids sorting twice by inverting the permutation in the last step.

Unisexual answered 12/3, 2011 at 19:1 Comment(6)

any recommendation on how to do this rowwise? – Sharpedged 19/11, 2017 at 14:9

For more than 1 dim see answer below. – Iddo 22/1, 2020 at 16:17

Just a minor correction, the indirection ranks[temp] isn't technically slicing, but indexing (see numpy.org/doc/stable/reference/arrays.indexing.html) – Cryotherapy 19/1, 2021 at 17:38

Looks a bit like the implementation of scipy.stats.rankdata (for ordinal rank). – Submerge 16/6, 2021 at 17:14

This method doesn't work when there is a tie in data – Teteak 8/11, 2022 at 19:10

Use argsort() twice will do it:

>>> array = [4,2,7,1]
>>> ranks = numpy.array(array).argsort().argsort()
>>> ranks
array([2, 1, 3, 0])

Charitycharivari answered 4/9, 2014 at 4:23 Comment(1)

this was already mentioned way before you posed your answer – Ronaronal 9/4, 2017 at 16:26

For a vectorized version of an averaged rank, see below. I love np.unique, it really widens the scope of what code can and cannot be efficiently vectorized. Aside from avoiding python for-loops, this approach also avoids the implicit double loop over 'a'.

import numpy as np

a = np.array( [4,1,6,8,4,1,6])

a = np.array([4,2,7,2,1])
rank = a.argsort().argsort()

unique, inverse = np.unique(a, return_inverse = True)

unique_rank_sum = np.zeros_like(unique)
np.add.at(unique_rank_sum, inverse, rank)
unique_count = np.zeros_like(unique)
np.add.at(unique_count, inverse, 1)

unique_rank_mean = unique_rank_sum.astype(np.float) / unique_count

rank_mean = unique_rank_mean[inverse]

print rank_mean

Gilreath answered 6/1, 2014 at 20:30 Comment(3)

by the way; I made this code to produce the same output as the other averaged rank code, but I can imagine the minimum rank of a group of repeating numbers works just as well. This can be obtained even more easily as >>> unique, index, inverse = np.unique(a, True, True) >>> rank_min = rank[index][inverse] – Gilreath 6/1, 2014 at 20:47

I am getting the following error with your solution(numpy 1.7.1): AttributeError: 'numpy.ufunc' object has no attribute 'at' – Anyways 14/7, 2017 at 9:33

This requires a more recent version of numpy; yours is quite ancient – Gilreath 14/7, 2017 at 10:8

I tried to extend both solution for arrays A of more than one dimension, supposing you process your array row-by-row (axis=1).

I extended the first code with a loop on rows; probably it can be improved

temp = A.argsort(axis=1)
rank = np.empty_like(temp)
rangeA = np.arange(temp.shape[1])
for iRow in xrange(temp.shape[0]): 
    rank[iRow, temp[iRow,:]] = rangeA

And the second one, following k.rooijers suggestion, becomes:

temp = A.argsort(axis=1)
rank = temp.argsort(axis=1)

I randomly generated 400 arrays with shape (1000,100); the first code took about 7.5, the second one 3.8.

Fencesitter answered 29/9, 2011 at 17:57 Comment(0)

Apart from the elegance and shortness of solutions, there is also the question of performance. Here is a little benchmark:

import numpy as np
from scipy.stats import rankdata
l = list(reversed(range(1000)))

%%timeit -n10000 -r5
x = (rankdata(l) - 1).astype(int)
>>> 128 µs ± 2.72 µs per loop (mean ± std. dev. of 5 runs, 10000 loops each)

%%timeit -n10000 -r5
a = np.array(l)
r = a.argsort().argsort()
>>> 69.1 µs ± 464 ns per loop (mean ± std. dev. of 5 runs, 10000 loops each)

%%timeit -n10000 -r5
a = np.array(l)
temp = a.argsort()
r = np.empty_like(temp)
r[temp] = np.arange(len(a))
>>> 63.7 µs ± 1.27 µs per loop (mean ± std. dev. of 5 runs, 10000 loops each)

Glynda answered 18/7, 2018 at 6:28 Comment(2)

Good idea, but for a fair comparison, you should use rankdata(l, method='ordinal') - 1. – Wrath 20/4, 2020 at 13:45

Also better to use a randomly generated array - l = random.choices(range(500),k=10000) - varying sample, size & iterations for fuller picture. Double slicing from @yupbank looks interesting – Uzzial 2/6, 2021 at 5:11

I tried the above methods, but failed because I had many zeores. Yes, even with floats duplicate items may be important.

So I wrote a modified 1D solution by adding a tie-checking step:

def ranks (v):
    import numpy as np
    t = np.argsort(v)
    r = np.empty(len(v),int)
    r[t] = np.arange(len(v))
    for i in xrange(1, len(r)):
        if v[t[i]] <= v[t[i-1]]: r[t[i]] = r[t[i-1]]
    return r

# test it
print sorted(zip(ranks(v), v))

I believe it's as efficient as it can be.

Alaska answered 6/7, 2012 at 9:33 Comment(0)

argsort and slice are symmetry operations.

try slice twice instead of argsort twice. since slice is faster than argsort

array = numpy.array([4,2,7,1])
order = array.argsort()
ranks = np.arange(array.shape[0])[order][order]

Mailable answered 30/10, 2019 at 12:1 Comment(2)

Wouldn't ranks = order[order] be equivalent? BTW, I do not think your code works in general. – Grandaunt 24/5, 2023 at 3:57

np.arange(array.shape[0])[order] does nothing except just evaluates to order and order[order] is totally wrong as the permutation is almost certainly not an involution. – Chuck 8/4 at 10:7

I liked the method by k.rooijers, but as rcoup wrote, repeated numbers are ranked according to array position. This was no good for me, so I modified the version to postprocess the ranks and merge any repeated numbers into a combined average rank:

import numpy as np
a = np.array([4,2,7,2,1])
r = np.array(a.argsort().argsort(), dtype=float)
f = a==a
for i in xrange(len(a)):
   if not f[i]: continue
   s = a == a[i]
   ls = np.sum(s)
   if ls > 1:
      tr = np.sum(r[s])
      r[s] = float(tr)/ls
   f[s] = False

print r  # array([ 3. ,  1.5,  4. ,  1.5,  0. ])

I hope this might help others too, I tried to find anothers solution to this, but couldn't find any...

Hypostasis answered 8/12, 2013 at 16:47 Comment(0)

More general version of one of the answers:

In [140]: x = np.random.randn(10, 3)

In [141]: i = np.argsort(x, axis=0)

In [142]: ranks = np.empty_like(i)

In [143]: np.put_along_axis(ranks, i, np.repeat(np.arange(x.shape[0])[:,None], x.shape[1], axis=1), axis=0)

See How to use numpy.argsort() as indices in more than 2 dimensions? to generalize to more dims.

Iddo answered 22/1, 2020 at 16:16 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags