Pandas ignores dropna=False with Categorical columns in groupby()
Asked Answered
A

1

8

I want to include NA values when using groupby() which does not happen by default. I think the option dropna=False make it happen. But when the column is of type Categorical the option has no effect.

I assume the best would say there is a well thought design decision behind that. Or maybe it is related to this pandas bug which I do not fully understand?

The pandas version I use here is 1.2.5.

#!/usr/bin/env python3
import pandas as pd

print(pd.__version__)  # 1.2.5

# initial data
df = pd.DataFrame(
    {
        '2019': [1, pd.NA, 0],
        'N': [2, 0, 7],
    }
)
print(df)

## groupby()'s working as expected

# without NA
res = df.groupby('2019').size()
print(f'\n{res}')

# include NA
res = df.groupby('2019', dropna=False).size()
print(f'\n{res}')

## now the problems
## convert to Category
df['2019'] = df['2019'].astype('category')

# PROBLEM: NA is ignored
res = df.groupby('2019', dropna=False).size()
print(f'\n{res}')

# PROBLEM: NA is ignored even observed has no effect
res = df.groupby('2019', dropna=False, observed=True).size()
print(f'\n{res}')

In the output you see the initial DataFrame first and then two groupby() outputs that behave as expected. But then the last two groupby() outputs ilustrating my problem.

1.2.5
   2019  N
0     1  2
1  <NA>  0
2     0  7

2019
0    1
1    1
dtype: int64

2019
0.0    1
1.0    1
NaN    1
dtype: int64

2019
0    1
1    1
dtype: int64

2019
1    1
0    1
dtype: int64
>>>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
Autostability answered 2/11, 2021 at 12:58 Comment(2)
I'd say it's more like a bug, which I guess the problem is df['2019'].dtype gives CategoricalDtype(categories=[0, 1], ordered=False) and groupby only searches over categories.Official
This is a bug acknowledged by pandas maintainers, see more at github.com/pandas-dev/pandas/issues/36327. They're looking for PRs to fix this if anyone is game!Weinert
S
4

This is a bug. It has been fixed and will be released in pandas 2.0.

The simplest workaround is to temporarily undo the categories thing:

orig = df['2019'].cat.categories.dtype
if np.issubdtype(orig, np.integer) or orig == 'bool':
    orig = 'Int64'  # Allow NA values.
res = df.astype({'2019': orig}).groupby('2019', dropna=False, observed=True).size()
Scheffler answered 30/1, 2023 at 14:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.