Calculate percentile of value in column

L

5

23

I have a dataframe with a column that has numerical values. This column is not well-approximated by a normal distribution. Given another numerical value, not in this column, how can I calculate its percentile in the column? That is, if the value is greater than 80% of the values in the column but less than the other 20%, it would be in the 20th percentile.

Levileviable answered 29/6, 2017 at 12:18 Comment(0)

V

1

Sort the column, and see if the value is in the first 20% or whatever percentile.

for example:

def in_percentile(my_series, val, perc=0.2): 
    myList=sorted(my_series.values.tolist())
    l=len(myList)
    return val>myList[int(l*perc)]

Or, if you want the actual percentile simply use searchsorted:

my_series.values.searchsorted(val)/len(my_series)*100

Ventriculus answered 29/6, 2017 at 12:46 Comment(2)

This way I have to iterate over all possible percentiles to find out which percentile the new value is in. – Levileviable 29/6, 2017 at 12:54

Please be warned that using searchsorted() will not provide accurate results. I'm getting "100" for a value that is in the 94.5% percentile. – Mondrian 31/7, 2023 at 18:14

P

44

To find the percentile of a value relative to an array (or in your case a dataframe column), use the scipy function stats.percentileofscore().

For example, if we have a value x (the other numerical value not in the dataframe), and a reference array, arr (the column from the dataframe), we can find the percentile of x by:

from scipy import stats
percentile = stats.percentileofscore(arr, x)

Note that there is a third parameter to the stats.percentileofscore() function that has a significant impact on the resulting value of the percentile, viz. kind. You can choose from rank, weak, strict, and mean. See the docs for more information.

For an example of the difference:

>>> df
   a
0  1
1  2
2  3
3  4
4  5

>>> stats.percentileofscore(df['a'], 4, kind='rank')
80.0

>>> stats.percentileofscore(df['a'], 4, kind='weak')
80.0

>>> stats.percentileofscore(df['a'], 4, kind='strict')
60.0

>>> stats.percentileofscore(df['a'], 4, kind='mean')
70.0

As a final note, if you have a value that is greater than 80% of the other values in the column, it would be in the 80th percentile (see the example above for how the kind method affects this final score somewhat) not the 20th percentile. See this Wikipedia article for more information.

Parodic answered 19/1, 2018 at 6:42 Comment(0)

M

5

Probably very late but still

df['column_name'].describe()

will give you the regular 25, 50 and 75 percentile with some additional data but if you want percentiles for some specific values then

df['column_name'].describe(percentiles=[0.1, 0.2, 0.3, 0.5])

This will give you 10th, 20th, 30th and 50th percentiles. You can give as many values as you want.

The resulting object can be accessed like a dict:

desc = df['column_name'].describe(percentiles=[0.1, 0.2, 0.3, 0.5])
print(desc)
print(desc['10%'])

Majka answered 4/11, 2019 at 6:17 Comment(0)

M

3

Since you're looking for values over/under a specific threshold, you could consider using pandas qcut function. If you wanted values under 20% and over 80%, divide your data into 5 equal sized partitions. Each partition would represent a 20% "chunk" of equal size (five 20% partitions is 100%). So, given a DataFrame with 1 column 'a' which represents the column you have data for:

df['newcol'] = pd.qcut(df['a'], 5, labels=False)

This will give you a new column to your DataFrame with each row having a value in (0, 1, 2, 3, 4). Where 0 represents your lowest 20% and 4 represents your highest 20% which is the 80% percentile.

Mcafee answered 4/6, 2019 at 21:41 Comment(0)

V

1

Sort the column, and see if the value is in the first 20% or whatever percentile.

for example:

def in_percentile(my_series, val, perc=0.2): 
    myList=sorted(my_series.values.tolist())
    l=len(myList)
    return val>myList[int(l*perc)]

Or, if you want the actual percentile simply use searchsorted:

my_series.values.searchsorted(val)/len(my_series)*100

Ventriculus answered 29/6, 2017 at 12:46 Comment(2)

This way I have to iterate over all possible percentiles to find out which percentile the new value is in. – Levileviable 29/6, 2017 at 12:54

Please be warned that using searchsorted() will not provide accurate results. I'm getting "100" for a value that is in the 94.5% percentile. – Mondrian 31/7, 2023 at 18:14

I

0

The answer is to divide percentile by 100 and call it quantile:

result = my_series.quantile(0.8)

Inearth answered 24/3 at 14:13 Comment(0)

Recommended topics

Hot tags