Pandas quantile failing with NaN's present
Asked Answered
P

1

8

I've encountered an interesting situation while calculating the inter-quartile range. Assuming we have a dataframe such as:

import pandas as pd
index=pd.date_range('2014 01 01',periods=10,freq='D')
data=pd.np.random.randint(0,100,(10,5))
data = pd.DataFrame(index=index,data=data)

data
Out[90]: 
             0   1   2   3   4
2014-01-01  33  31  82   3  26
2014-01-02  46  59   0  34  48
2014-01-03  71   2  56  67  54
2014-01-04  90  18  71  12   2
2014-01-05  71  53   5  56  65
2014-01-06  42  78  34  54  40
2014-01-07  80   5  76  12  90
2014-01-08  60  90  84  55  78
2014-01-09  33  11  66  90   8
2014-01-10  40   8  35  36  98

# test for q1 values (this works)
data.quantile(0.25)
Out[111]: 
0    40.50
1     8.75
2    34.25
3    17.50
4    29.50

# break it by inserting row of nans
data.iloc[-1] = pd.np.NaN

data.quantile(0.25)
Out[115]: 
0    42
1    11
2    34
3    12
4    26

The first quartile can be calculated by taking the median of values in the dataframe that fall below the overall median, so we can see what data.quantile(0.25) should have yielded. e.g.

med = data.median()
q1  = data[data<med].median()
q1
Out[119]: 
0    37.5
1     8.0
2    19.5
3    12.0
4    17.0

It seems that quantile is failing to provide an appropriate representation of q1 etc. since it is not doing a good job of handling the NaN values (i.e. it works without NaNs, but not with NaNs).

I thought this may not be a "NaN" issue, rather it might be quantile failing to handle even-numbered data sets (i.e. where the median must be calculated as the mean of the two central numbers). However, after testing with dataframes with both even and odd-numbers of rows I saw that quantile handled these situations properly. The problem seems to arise only when NaN values are present in the dataframe.

I would like to use quntile to calculate the rolling q1/q3 values in my dataframe, however, this will not work with NaN's present. Can anyone provide a solution to this issue?

Peria answered 4/6, 2014 at 19:48 Comment(5)
did you try df.dropna()?Order
I have applied dropna(how='all'), which makes my example misleading. However, I am not willing to apply dropna(how='any'), since I do not want to lose valid data just because a NaN sits in the next column over. Your suggestion is good. That being said, there still seems to be a fundamental problem with quantile (or so I think!).Peria
sorry bud, good luck!Order
Are you sure that result should be data[data<med].median() and not data[data<=med].median()?Jerome
And this (above comment of @alko) gives indeed the same result as pandas quantileSilversmith
P
4

Internally, quantile uses numpy.percentile over the non-null values. When you change the last row of data to NaNs you're essentially left with an array array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) in the first column

Calculating np.percentile(array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) gives 42.

From the docstring:

Given a vector V of length N, the qth percentile of V is the qth ranked value in a sorted copy of V. A weighted average of the two nearest neighbors is used if the normalized ranking does not match q exactly. The same as the median if q=50, the same as the minimum if q=0 and the same as the maximum if q=100.

Planetesimal answered 4/6, 2014 at 21:30 Comment(2)
And to add, if you use np.percentile directly (with the NaNs), eg with df2.apply(np.percentile, args=[25]) will you give other values, as the NaNs are not removed before calculating the quantile as with pandas.Silversmith
@Planetesimal the code np.percentile(np.array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.], 0.25) returns 32.999, not 42. I think using np.quantile() can return 42, the code np.quantile(np.array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.], 0.25) return 42 .Lucais

© 2022 - 2024 — McMap. All rights reserved.