Does the quantile() function in Pandas ignore NaN?
Asked Answered
E

2

11

I have a dfAB

import pandas as pd
import random

A = [ random.randint(0,100) for i in range(10) ]
B = [ random.randint(0,100) for i in range(10) ]

dfAB = pd.DataFrame({ 'A': A, 'B': B })
dfAB

We can take the quantile function, because I want to know the 75th percentile of the columns:

dfAB.quantile(0.75)

But say now I put some NaNs in the dfAB and re-do the function, obviously its differnt:

dfAB.loc[5:8]=np.nan
dfAB.quantile(0.75)

Basically, when I calculated the mean of the dfAB, I passed skipna to ignore Na's as I didn't want them affecting my stats (I have quite a few in my code, on purpose, and obv making them zero doesn't help)

dfAB.mean(skipna=True)

Thus, what im getting at is whether/how the quantile function addresses NaN's?

Electrodynamics answered 4/9, 2018 at 17:27 Comment(5)
Well, if you pass skipna=True, I guess it skips them.Ceuta
If you not pass skipna=True , in mean , if it have nan , it will return nanRainstorm
Don't ask us; we're biological units. Try it and see what happens. Load a df with half NaN values and play around for a few minutes.Oddfellow
side comment on the way you generate A, B. you can just A = np.random.randint(100, size=10)Cassicassia
Docs didn't have a reference to skipnan for quantile function, that's why I asked.. DataFrame.quantile(q=0.5, axis=0, numeric_only=True, interpolation='linear') @sacul kindly highlighted the correct comparator, which I didn't know existed, in np.nanpercentile Thanks allElectrodynamics
M
17

Yes, this appears to be the way that pd.quantile deals with NaN values. To illustrate, you can compare the results to np.nanpercentile, which explicitely Computes the qth percentile of the data along the specified axis, while ignoring nan values (quoted from the docs, my emphasis):

>>> dfAB
      A     B
0   5.0  10.0
1  43.0  67.0
2  86.0   2.0
3  61.0  83.0
4   2.0  27.0
5   NaN   NaN
6   NaN   NaN
7   NaN   NaN
8   NaN   NaN
9  27.0  70.0

>>> dfAB.quantile(0.75)
A    56.50
B    69.25
Name: 0.75, dtype: float64

>>> np.nanpercentile(dfAB, 75, axis=0)
array([56.5 , 69.25])

And see that they are equivalent

Montane answered 4/9, 2018 at 17:38 Comment(1)
For Pandas v2.0 and up the default for numeric_only is False. See docs. I expect this will change the output of the answer here.Vogler
F
3

Yes. pd.quantile() will ignore NaN values when calculating the quantile.

To prove this, we can compare it with np.nanquantile, which compute the qth quantile of the data along the specified axis, while ignoring nan values[source] .

>>> random.seed(7)
>>> A = [ random.randint(0,100) for i in range(10) ]
>>> B = [ random.randint(0,100) for i in range(10) ]
>>> dfAB = pd.DataFrame({'A': A, 'B': B})
>>> dfAB.loc[5:8]=np.nan

>>> dfAB
      A     B
0  41.0   7.0
1  19.0  64.0
2  50.0  27.0
3  83.0   4.0
4   6.0  11.0
5   NaN   NaN
6   NaN   NaN
7   NaN   NaN
8   NaN   NaN
9  74.0  11.0

>>> dfAB.quantile(0.75)
A    68.0
B    23.0
Name: 0.75, dtype: float64

>>> np.nanquantile(dfAB, 0.75, axis=0)
array([68.  23.])
Fitment answered 17/11, 2021 at 10:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.