I want to include NA
values when using groupby()
which does not happen by default. I think the option dropna=False
make it happen.
But when the column is of type Categorical
the option has no effect.
I assume the best would say there is a well thought design decision behind that. Or maybe it is related to this pandas bug which I do not fully understand?
The pandas version I use here is 1.2.5
.
#!/usr/bin/env python3
import pandas as pd
print(pd.__version__) # 1.2.5
# initial data
df = pd.DataFrame(
{
'2019': [1, pd.NA, 0],
'N': [2, 0, 7],
}
)
print(df)
## groupby()'s working as expected
# without NA
res = df.groupby('2019').size()
print(f'\n{res}')
# include NA
res = df.groupby('2019', dropna=False).size()
print(f'\n{res}')
## now the problems
## convert to Category
df['2019'] = df['2019'].astype('category')
# PROBLEM: NA is ignored
res = df.groupby('2019', dropna=False).size()
print(f'\n{res}')
# PROBLEM: NA is ignored even observed has no effect
res = df.groupby('2019', dropna=False, observed=True).size()
print(f'\n{res}')
In the output you see the initial DataFrame first and then two groupby() outputs that behave as expected. But then the last two groupby() outputs ilustrating my problem.
1.2.5
2019 N
0 1 2
1 <NA> 0
2 0 7
2019
0 1
1 1
dtype: int64
2019
0.0 1
1.0 1
NaN 1
dtype: int64
2019
0 1
1 1
dtype: int64
2019
1 1
0 1
dtype: int64
>>>
df['2019'].dtype
givesCategoricalDtype(categories=[0, 1], ordered=False)
andgroupby
only searches over categories. – Official