Confusing behaviour of Pandas crosstab() function with dataframe containing NaN values
Asked Answered
W

3

21

I'm using Python 3.4.1 with numpy 0.10.1 and pandas 0.17.0. I have a large dataframe that lists species and gender of individual animals. It's a real-world dataset and there are, inevitably, missing values represented by NaN. A simplified version of the data can be generated as:

import numpy as np
import pandas as pd
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                        'species': ["dog","dog",np.nan,"dog","dog","cat","cat","cat","dog","cat","cat","dog","dog","dog","dog",np.nan,"cat","cat","dog","dog"],
                        'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"]})

Printing the dataframe gives:

    gender  id species
0     male   1     dog
1   female   2     dog
2   female   3     NaN
3     male   4     dog
4     male   5     dog
5   female   6     cat
6   female   7     cat
7      NaN   8     cat
8     male   9     dog
9     male  10     cat
10  female  11     cat
11    male  12     dog
12  female  13     dog
13  female  14     dog
14    male  15     dog
15  female  16     NaN
16    male  17     cat
17  female  18     cat
18     NaN  19     dog
19    male  20     dog

I want to generate a cross-tabulated table to show number of males and females in each species using the following:

pd.crosstab(tempDF['species'],tempDF['gender'])

This produces the following table:

gender   female  male
species              
cat           4     2
dog           3     7

Which is what I'd expect. However, if I include the margins=True option, it produces:

pd.crosstab(tempDF['species'],tempDF['gender'],margins=True)

gender   female  male  All
species                   
cat           4     2    7
dog           3     7   11
All           9     9   20

As you can see, the marginal totals appear to be incorrect, presumably caused by the missing data in the dataframe. Is this intended behaviour? In my mind, it seems very confusing. Surely marginal totals should be totals of rows and columns as they appear in the table and not include any missing data that isn't represented in the table. Including dropna=False does not affect the outcome.

I can delete any row with a NaN before creating the table but that seems to be a lot of extra work and a lot of extra things to think about when doing an analysis. Should I report this as a bug?

Witling answered 23/10, 2015 at 13:18 Comment(2)
Maybe create second dataframe with df.dropna(), then call crosstab on this new dataframe?Greyhen
I agree that would be an option but it adds an extra layer of complexity to, what should be, a very simple process. And (if there were lots of other variables in the dataframe, each with NaNs) it could mean generating lots of new dataframes for each cross-table you wanted to produce.Witling
W
24

I suppose one workaround would be to convert the NaNs to 'missing' before creating the table and then the cross-tubulation will include columns and rows specifically for missing values:

pd.crosstab(tempDF['species'].fillna('missing'),tempDF['gender'].fillna('missing'),margins=True)

gender   female  male  missing  All
species                            
cat           4     2        1    7
dog           3     7        1   11
missing       2     0        0    2
All           9     9        2   20

Personally, I would like to see that the default behaviour so I wouldn't have to remember to replace all the NaNs in every crosstab calculation.

Witling answered 23/10, 2015 at 13:59 Comment(0)
S
6

You're not the only one experiencing this. It not only happens with pd.crosstab, but also pd.pivot_table and DataFrame.groupby

In the docs it says this about groupby excluding Na's:

NA groups in GroupBy are automatically excluded. This behavior is consistent with R, for example.

You can find some good solutions in this post: groupby columns with NaN (missing) values

Maybe one day someone will solve this issue: https://github.com/pandas-dev/pandas/issues/10772

Serviceable answered 20/4, 2017 at 9:7 Comment(0)
D
0

You can set dropna=True and then the totals won't include the missing data. But if you did want to include the missings then the fillna option is best

Deceased answered 12/4, 2022 at 10:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.