Rank items in an array using Python/NumPy, without sorting array twice
Asked Answered
S

11

160

I have an array of numbers and I'd like to create another array that represents the rank of each item in the first array. I'm using Python and NumPy.

For example:

array = [4,2,7,1]
ranks = [2,1,3,0]

Here's the best method I've come up with:

array = numpy.array([4,2,7,1])
temp = array.argsort()
ranks = numpy.arange(len(array))[temp.argsort()]

Are there any better/faster methods that avoid sorting the array twice?

Syringa answered 12/3, 2011 at 18:52 Comment(2)
Your last line is equivalent to ranks = temp.argsort().Unisexual
This method doesn't work when there is a tie in dataTeteak
U
107

Use advanced indexing on the left-hand side in the last step:

array = numpy.array([4,2,7,1])
temp = array.argsort()
ranks = numpy.empty_like(temp)
ranks[temp] = numpy.arange(len(array))

This avoids sorting twice by inverting the permutation in the last step.

Unisexual answered 12/3, 2011 at 19:1 Comment(6)
Perfect, thank you! I knew there was a solution and it would seem obvious once I saw it. I did some testing with timeit, and this method is slightly slower for small arrays. On my machine they're equal when the array has 2,000 elements. At 20,000 elements, your method is about 25% faster.Syringa
any recommendation on how to do this rowwise?Sharpedged
For more than 1 dim see answer below.Iddo
Just a minor correction, the indirection ranks[temp] isn't technically slicing, but indexing (see numpy.org/doc/stable/reference/arrays.indexing.html)Cryotherapy
Looks a bit like the implementation of scipy.stats.rankdata (for ordinal rank).Submerge
This method doesn't work when there is a tie in dataTeteak
S
145

Use argsort twice, first to obtain the order of the array, then to obtain ranking:

array = numpy.array([4,2,7,1])
order = array.argsort()
ranks = order.argsort()

When dealing with 2D (or higher dimensional) arrays, be sure to pass an axis argument to argsort to order over the correct axis.

Specialist answered 7/6, 2011 at 14:12 Comment(10)
Note that if numbers are repeated in your input array (eg. [4,2,7,1,1]) the output will rank those numbers based on their array position ([3,2,4,0,1])Revert
Sorting twice is inefficient. @Sven Marnach's answer shows how to accomplish the ranking with a single call to argsort.Wrath
@WarrenWeckesser: I just tested the difference between the two, and you're right for large arrays, but for anything smaller (n < 100), double argsort is faster (about 20% faster for n=100, and about 5 times faster for n=10). So if you have to do a lot of rankings across lots of small sets of values, this method is much better.Desex
That's a good point--I see roughly the same effect. For large enough input, the improved efficiency wins, but for small inputs, the double argsort apparently has less overhead.Wrath
@WarrenWeckesser: Actually, I'm wrong, this method is hands-down better. Both methods are far faster than the scipy.stats method, too. Results: gist.github.com/naught101/14042d91a2d0f18a6ae4Desex
@naught101: There is a bug in your script. The line array = np.random.rand(10) should be array = np.random.rand(n).Wrath
Wouldn't the following also be equivalent without having to sort twice? array = numpy.array([4,2,7,1]); order = array.argsort(); ranks = order[order]Boride
I couldn't find an explanation as to why double argsort works - I hope this helps berkayantmen.com/rankOrbadiah
Doesn't work on an input with repeated elements.Inquietude
This method doesn't work when there is a tie in dataTeteak
W
132

This question is a few years old, and the accepted answer is great, but I think the following is still worth mentioning. If you don't mind the dependence on scipy, you can use scipy.stats.rankdata:

In [22]: from scipy.stats import rankdata

In [23]: a = [4, 2, 7, 1]

In [24]: rankdata(a)
Out[24]: array([ 3.,  2.,  4.,  1.])

In [25]: (rankdata(a) - 1).astype(int)
Out[25]: array([2, 1, 3, 0])

A nice feature of rankdata is that the method argument provides several options for handling ties. For example, there are three occurrences of 20 and two occurrences of 40 in b:

In [26]: b = [40, 20, 70, 10, 20, 50, 30, 40, 20]

The default assigns the average rank to the tied values:

In [27]: rankdata(b)
Out[27]: array([ 6.5,  3. ,  9. ,  1. ,  3. ,  8. ,  5. ,  6.5,  3. ])

method='ordinal' assigns consecutive ranks:

In [28]: rankdata(b, method='ordinal')
Out[28]: array([6, 2, 9, 1, 3, 8, 5, 7, 4])

method='min' assigns the minimum rank of the tied values to all the tied values:

In [29]: rankdata(b, method='min')
Out[29]: array([6, 2, 9, 1, 2, 8, 5, 6, 2])

See the docstring for more options.

Wrath answered 15/3, 2015 at 11:15 Comment(4)
yeah, this is the best answer anywhere where edge cases are important.Desex
I find it interesting that rankdata seems to use the same mechanism as the accepted answer to generate the initial ranking internally.Iqbal
This is a great answer, it also works nice for my pandas DataFrames where I want to rank elements in each row of a 2D DataFrame but specifying axis=0 in rankdata() and handle ties at the same time.Antecedents
DataFrame.rank() is faster.Hyperactive
U
107

Use advanced indexing on the left-hand side in the last step:

array = numpy.array([4,2,7,1])
temp = array.argsort()
ranks = numpy.empty_like(temp)
ranks[temp] = numpy.arange(len(array))

This avoids sorting twice by inverting the permutation in the last step.

Unisexual answered 12/3, 2011 at 19:1 Comment(6)
Perfect, thank you! I knew there was a solution and it would seem obvious once I saw it. I did some testing with timeit, and this method is slightly slower for small arrays. On my machine they're equal when the array has 2,000 elements. At 20,000 elements, your method is about 25% faster.Syringa
any recommendation on how to do this rowwise?Sharpedged
For more than 1 dim see answer below.Iddo
Just a minor correction, the indirection ranks[temp] isn't technically slicing, but indexing (see numpy.org/doc/stable/reference/arrays.indexing.html)Cryotherapy
Looks a bit like the implementation of scipy.stats.rankdata (for ordinal rank).Submerge
This method doesn't work when there is a tie in dataTeteak
C
8

Use argsort() twice will do it:

>>> array = [4,2,7,1]
>>> ranks = numpy.array(array).argsort().argsort()
>>> ranks
array([2, 1, 3, 0])
Charitycharivari answered 4/9, 2014 at 4:23 Comment(1)
this was already mentioned way before you posed your answerRonaronal
G
6

For a vectorized version of an averaged rank, see below. I love np.unique, it really widens the scope of what code can and cannot be efficiently vectorized. Aside from avoiding python for-loops, this approach also avoids the implicit double loop over 'a'.

import numpy as np

a = np.array( [4,1,6,8,4,1,6])

a = np.array([4,2,7,2,1])
rank = a.argsort().argsort()

unique, inverse = np.unique(a, return_inverse = True)

unique_rank_sum = np.zeros_like(unique)
np.add.at(unique_rank_sum, inverse, rank)
unique_count = np.zeros_like(unique)
np.add.at(unique_count, inverse, 1)

unique_rank_mean = unique_rank_sum.astype(np.float) / unique_count

rank_mean = unique_rank_mean[inverse]

print rank_mean
Gilreath answered 6/1, 2014 at 20:30 Comment(3)
by the way; I made this code to produce the same output as the other averaged rank code, but I can imagine the minimum rank of a group of repeating numbers works just as well. This can be obtained even more easily as >>> unique, index, inverse = np.unique(a, True, True) >>> rank_min = rank[index][inverse]Gilreath
I am getting the following error with your solution(numpy 1.7.1): AttributeError: 'numpy.ufunc' object has no attribute 'at'Anyways
This requires a more recent version of numpy; yours is quite ancientGilreath
F
5

I tried to extend both solution for arrays A of more than one dimension, supposing you process your array row-by-row (axis=1).

I extended the first code with a loop on rows; probably it can be improved

temp = A.argsort(axis=1)
rank = np.empty_like(temp)
rangeA = np.arange(temp.shape[1])
for iRow in xrange(temp.shape[0]): 
    rank[iRow, temp[iRow,:]] = rangeA

And the second one, following k.rooijers suggestion, becomes:

temp = A.argsort(axis=1)
rank = temp.argsort(axis=1)

I randomly generated 400 arrays with shape (1000,100); the first code took about 7.5, the second one 3.8.

Fencesitter answered 29/9, 2011 at 17:57 Comment(0)
G
4

Apart from the elegance and shortness of solutions, there is also the question of performance. Here is a little benchmark:

import numpy as np
from scipy.stats import rankdata
l = list(reversed(range(1000)))

%%timeit -n10000 -r5
x = (rankdata(l) - 1).astype(int)
>>> 128 µs ± 2.72 µs per loop (mean ± std. dev. of 5 runs, 10000 loops each)

%%timeit -n10000 -r5
a = np.array(l)
r = a.argsort().argsort()
>>> 69.1 µs ± 464 ns per loop (mean ± std. dev. of 5 runs, 10000 loops each)

%%timeit -n10000 -r5
a = np.array(l)
temp = a.argsort()
r = np.empty_like(temp)
r[temp] = np.arange(len(a))
>>> 63.7 µs ± 1.27 µs per loop (mean ± std. dev. of 5 runs, 10000 loops each)
Glynda answered 18/7, 2018 at 6:28 Comment(2)
Good idea, but for a fair comparison, you should use rankdata(l, method='ordinal') - 1.Wrath
Also better to use a randomly generated array - l = random.choices(range(500),k=10000) - varying sample, size & iterations for fuller picture. Double slicing from @yupbank looks interestingUzzial
A
2

I tried the above methods, but failed because I had many zeores. Yes, even with floats duplicate items may be important.

So I wrote a modified 1D solution by adding a tie-checking step:

def ranks (v):
    import numpy as np
    t = np.argsort(v)
    r = np.empty(len(v),int)
    r[t] = np.arange(len(v))
    for i in xrange(1, len(r)):
        if v[t[i]] <= v[t[i-1]]: r[t[i]] = r[t[i-1]]
    return r

# test it
print sorted(zip(ranks(v), v))

I believe it's as efficient as it can be.

Alaska answered 6/7, 2012 at 9:33 Comment(0)
M
1

argsort and slice are symmetry operations.

try slice twice instead of argsort twice. since slice is faster than argsort

array = numpy.array([4,2,7,1])
order = array.argsort()
ranks = np.arange(array.shape[0])[order][order]
Mailable answered 30/10, 2019 at 12:1 Comment(2)
Wouldn't ranks = order[order] be equivalent? BTW, I do not think your code works in general.Grandaunt
np.arange(array.shape[0])[order] does nothing except just evaluates to order and order[order] is totally wrong as the permutation is almost certainly not an involution.Chuck
H
0

I liked the method by k.rooijers, but as rcoup wrote, repeated numbers are ranked according to array position. This was no good for me, so I modified the version to postprocess the ranks and merge any repeated numbers into a combined average rank:

import numpy as np
a = np.array([4,2,7,2,1])
r = np.array(a.argsort().argsort(), dtype=float)
f = a==a
for i in xrange(len(a)):
   if not f[i]: continue
   s = a == a[i]
   ls = np.sum(s)
   if ls > 1:
      tr = np.sum(r[s])
      r[s] = float(tr)/ls
   f[s] = False

print r  # array([ 3. ,  1.5,  4. ,  1.5,  0. ])

I hope this might help others too, I tried to find anothers solution to this, but couldn't find any...

Hypostasis answered 8/12, 2013 at 16:47 Comment(0)
I
0

More general version of one of the answers:

In [140]: x = np.random.randn(10, 3)

In [141]: i = np.argsort(x, axis=0)

In [142]: ranks = np.empty_like(i)

In [143]: np.put_along_axis(ranks, i, np.repeat(np.arange(x.shape[0])[:,None], x.shape[1], axis=1), axis=0)

See How to use numpy.argsort() as indices in more than 2 dimensions? to generalize to more dims.

Iddo answered 22/1, 2020 at 16:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.