Pandas groupby with categories with redundant nan
Asked Answered
P

6

46

I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. But it insists that, when grouping by multiple categories, every combination of categories must be accounted for.

I sometimes use categories even when there's a low density of common strings, simply because those strings are long and it saves memory / improves performance. Sometimes there are thousands of categories in each column. When grouping by 3 columns, pandas forces us to hold results for 1000^3 groups.

My question: is there a convenient way to use groupby with categories while avoiding this untoward behaviour? I'm not looking for any of these solutions:

  • Recreating all the functionality via numpy.
  • Continually converting to strings/codes before groupby, reverting to categories later.
  • Making a tuple column from group columns, then group by the tuple column.

I'm hoping there's a way to modify just this particular pandas idiosyncrasy. A simple example is below. Instead of 4 categories I want in the output, I end up with 12.

import pandas as pd

group_cols = ['Group1', 'Group2', 'Group3']

df = pd.DataFrame([['A', 'B', 'C', 54.34],
                   ['A', 'B', 'D', 61.34],
                   ['B', 'A', 'C', 514.5],
                   ['B', 'A', 'A', 765.4],
                   ['A', 'B', 'D', 765.4]],
                  columns=(group_cols+['Value']))

for col in group_cols:
    df[col] = df[col].astype('category')

df.groupby(group_cols, as_index=False).sum()

Group1  Group2  Group3  Value
#   A   A   A   NaN
#   A   A   C   NaN
#   A   A   D   NaN
#   A   B   A   NaN
#   A   B   C   54.34
#   A   B   D   826.74
#   B   A   A   765.40
#   B   A   C   514.50
#   B   A   D   NaN
#   B   B   A   NaN
#   B   B   C   NaN
#   B   B   D   NaN

Bounty update

The issue is poorly addressed by pandas development team (cf github.com/pandas-dev/pandas/issues/17594). Therefore, I am looking for responses that address any of the following:

  1. Why, with reference to pandas source code, is categorical data treated differently in groupby operations?
  2. Why would the current implementation be preferred? I appreciate this is subjective, but I am struggling to find any answer to this question. Current behaviour is prohibitive in many situations without cumbersome, potentially expensive, workarounds.
  3. Is there a clean solution to override pandas treatment of categorical data in groupby operations? Note the 3 no-go routes (dropping down to numpy; conversions to/from codes; creating and grouping by tuple columns). I would prefer a solution that is "pandas-compliant" to minimise / avoid loss of other pandas categorical functionality.
  4. A response from pandas development team to support and clarify existing treatment. Also, why should considering all category combinations not be configurable as a Boolean parameter?

Bounty update #2

To be clear, I'm not expecting answers to all of the above 4 questions. The main question I am asking is whether it's possible, or advisable, to overwrite pandas library methods so that categories are treated in a way that facilitates groupby / set_index operations.

Posthaste answered 27/1, 2018 at 1:12 Comment(8)
Here is an issue with Jeff's answer but I cannot follow his logic.Rosiarosicrucian
@ayhan, I don't follow his logic either. I don't think it's a difficult programming task to have an argument in groupby to specify whether we want to calculate all category combinations.Posthaste
even df.set_index(group_cols).sum(level=[0,1,2]) return the same ..Healing
@jp_data_analysis I think we need a parameter like ignore_missing=Boolean for missing category combination in groupby. Someone might add that in future. Lets us try to ask for a feature request in githubYanez
@Dark, somebody did (effectively) ask and it was pretty quickly dismissed: github.com/pandas-dev/pandas/issues/17594Posthaste
I don't know why Jeff closed that issue so quickly, maybe he didn't understand the issue? You can still comment in the issue (I think) and it could be re-opened. Perhaps if a few people commented there it would increase the odds of the issue being re-opened? I agree with Dark's suggestion that an optional argument would be nice there.Aglaia
Since you're using the software for free, I feel you can't really complain about it. But it's open source and so you're free to make changes yourself.Bradawl
This turned out to be my problem also: I could not understand why my pandas groupby included zero counts in the results but had no idea it was caused by the categorical dtypes forcing all combinations, so had no idea how to search for the solution. [Vaguely hoping this comment helps someone avoid the long process I went through.]Washcloth
H
48

Since Pandas 0.23.0, the groupby method can now take a parameter observed which fixes this issue if it is set to True (False by default). Below is the exact same code as in the question with just observed=True added :

import pandas as pd

group_cols = ['Group1', 'Group2', 'Group3']

df = pd.DataFrame([['A', 'B', 'C', 54.34],
                   ['A', 'B', 'D', 61.34],
                   ['B', 'A', 'C', 514.5],
                   ['B', 'A', 'A', 765.4],
                   ['A', 'B', 'D', 765.4]],
                  columns=(group_cols+['Value']))

for col in group_cols:
    df[col] = df[col].astype('category')

df.groupby(group_cols, as_index=False, observed=True).sum()

enter image description here

Heaves answered 29/5, 2018 at 8:22 Comment(2)
Wonderful... My 1 categorical field of 9 column-groupby, was multiplying my groupby to 21 TB... yikes. This solved the problem entirely.Foppish
I suppose pandas must have their reasons, but I also found this groupby behaviour for categorical fields very perplexing!Pease
T
5

I was able to get a solution that should work really well. I'll edit my post with a better explanation. But in the mean time, does this work well for you?

import pandas as pd

group_cols = ['Group1', 'Group2', 'Group3']

df = pd.DataFrame([['A', 'B', 'C', 54.34],
                   ['A', 'B', 'D', 61.34],
                   ['B', 'A', 'C', 514.5],
                   ['B', 'A', 'A', 765.4],
                   ['A', 'B', 'D', 765.4]],
                  columns=(group_cols+['Value']))
for col in group_cols:
    df[col] = df[col].astype('category')


result = df.groupby([df[col].values.codes for col in group_cols]).sum()
result = result.reset_index()
level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
result = result.rename(columns=level_to_column_name)
for col in group_cols:
    result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
result

So the answer to this felt more like a proper programming than a normal Pandas question. Under the hood, all categorical series are just a bunch of numbers that index into a name of categories. I did a groupby on these underlying numbers because they don't have the same problem as categorical columns. After doing this I had to rename the columns. I then used the from_codes constructor to create efficiently turn the list of integers back into a categorical column.

Group1  Group2  Group3  Value
A       B       C       54.34
A       B       D       826.74
B       A       A       765.40
B       A       C       514.50

So I understand that this isn't exactly your answer but I've made my solution into a little function for people that have this problem in the future.

def categorical_groupby(df,group_cols,agg_fuction="sum"):
    "Does a groupby on a number of categorical columns"
    result = df.groupby([df[col].values.codes for col in group_cols]).agg(agg_fuction)
    result = result.reset_index()
    level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
    result = result.rename(columns=level_to_column_name)
    for col in group_cols:
        result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
    return result

call it like this:

df.pipe(categorical_groupby,group_cols)
Tani answered 27/1, 2018 at 2:48 Comment(10)
It works, but I'm well aware of this option. Convert to codes (better than casting as strings) every time you want to groupby, then convert codes back to categories. I've reworded the second of the 3 "routes I don't want to go down" to make it clearer.Posthaste
I was thinking more along the lines of: "can we easily rewire the pandas group indexer for categoricals" so that all these mappings are not necessary.Posthaste
Ah. Sorry for not noticing. Well, if you have to do this sort of calculation often, you could just make a function with all these steps. It's not perfect but I certainly have a few functions like this. Workarounds that are fast but make the code confusing.Tani
Much appreciated, upvoted your answer so others can see how to access category code mappings. For large dataframes and large numbers of groupby operations, the to and fro mappings can become expensive.Posthaste
I'm curious, which step is expensive?Tani
Getting from categories to codes is cheap. From codes to categories is relatively expensive.Posthaste
@jp_data_analysis So I just ran a test. The only expensive step is the actual groupby. I agree with you that it is a stupid work around.Tani
Yep, I agree with you in relative terms the groupby is more expensive. But it's a pain.. as @Wen pointed out you can't even do df.set_index(group_cols).sum(level=[0,1,2])Posthaste
It has to do with the way that a multi index is set up. Internally, it's basically just a group of categories columns.Tani
agreed, but this only happens with "categories multi index" not "string/object type multi index", right?Posthaste
V
5

I found the behavior similar to what's documented in the operations section of Categorical Data.

In particular, similar to

In [121]: cats2 = pd.Categorical(["a","a","b","b"], categories=["a","b","c"])

In [122]: df2 = pd.DataFrame({"cats":cats2,"B":["c","d","c","d"], "values":[1,2,3,4]})

In [123]: df2.groupby(["cats","B"]).mean()
Out[123]: 
        values
cats B        
a    c     1.0
     d     2.0
b    c     3.0
     d     4.0
c    c     NaN
     d     NaN

Some other words describing the related behavior in Series and groupby. There is also a pivot table example in the end of the section.

Apart from Series.min(), Series.max() and Series.mode(), the following operations are possible with categorical data:

Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data:

Groupby will also show “unused” categories:

The words and the example are cited from Categorical Data.

Vatican answered 2/2, 2018 at 6:57 Comment(2)
I appreciate this is the current treatment. Yes, it's well documented. There have been one or two people who have asked for more flexible treatment, e.g. see here, but these arguments seem to be summarily brushed aside. Seems like nobody is proposing to overwrite pandas methods such as set_index or groupby to fix this "properly" - probably for good reason.Posthaste
@jp_data_analysis I think perhaps you can try to open an issue in pandas repo. It might be better to gain explanation from them or just propose the feature once again. While there are enough people asking for the feature, they might consider it.Vatican
T
4

There is a lot of questions to be answered here.
Let's start by understanding what a 'category' is...

Definition of Categorical dtype

Quoting from pandas docs for "Categorical Data":

Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.

There are two points I want to focus on here:

  1. The definition of categoricals as a statistical variable:
    basically, this means we have to look at them from a statistical point of view, not the "regular" programming one. i.e. they are not 'enumerates'. Statistical categorical variables has specific operations and usecases, you can read more about them in wikipedia.
    I'll talk more about this after the second point.

  2. Categories are levels in R:
    We can understand more about categoricals if we read about R levels and factors.
    I don't know much about R, but I found this source simple and sufficient. Quoting an interesting example from it:

    When a factor is first created, all of its levels are stored along with the factor, and if subsets of the factor are extracted, they will retain all of the original levels. This can create problems when constructing model matrices and may or may not be useful when displaying the data using, say, the table function. As an example, consider a random sample from the letters vector, which is part of the base R distribution.
    
    > lets = sample(letters,size=100,replace=TRUE)
    > lets = factor(lets)
    > table(lets[1:5])
    
    a b c d e f g h i j k l m n o p q r s t u v w x y z
    1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1
    
    Even though only five of the levels were actually represented, the table function shows the frequencies for all of the levels of the original factors. To change this, we can simply use another call to factor
    
    > table(factor(lets[1:5]))
    
    a k q s z
    1 1 1 1 1
    

Basically this tells us that displaying/using all the categories even if they are not needed is not that uncommon. And actually, it's the default behavior!
This is due to the usual use-cases of categorical variables in statistics. Almost in all the cases you do care about all the categories even if they are not used. Take for example the pandas function cut.

I hope by this point that you understood why this behavior exists in pandas.

GroupBy on Categorical Variables

As of why does groupby consider all the combinations of categories: I can't say for sure, but my best guess based on a quick review of the source code (and the github issue you mentioned), is that they consider the groupby on categorical variables an interaction between them. Hence, It should consider all the pairs/tuples (like a Cartesian product). AFAIK, this helps a lot when you are trying to do something like ANOVA.
This also means that in this context you can't think of it in the usual SQL-like terminology.

Solutions?

Ok, but what if you don't want this behavior?
To the best of my knowledge, and taking into account that I spent the last night tracing this in pandas source code, you can't "disable" it. It's hard coded in every critical step.
However, because of the way groupby works, the actual "expanding" doesn't happen until it's needed. For example, when calling sum over the groups or trying to print them.
Hence, you can do any of the following to get only the needed groups:

df.groupby(group_cols).indices
#{('A', 'B', 'C'): array([0]),
# ('A', 'B', 'D'): array([1, 4]),
# ('B', 'A', 'A'): array([3]),
# ('B', 'A', 'C'): array([2])}

df.groupby(group_cols).groups
#{('A', 'B', 'C'): Int64Index([0], dtype='int64'),
# ('A', 'B', 'D'): Int64Index([1, 4], dtype='int64'),
# ('B', 'A', 'A'): Int64Index([3], dtype='int64'),
# ('B', 'A', 'C'): Int64Index([2], dtype='int64')}

# an example
for g in df.groupby(group_cols).groups:
    print(g, grt.get_group(g).sum()[0])
#('A', 'B', 'C') 54.34
#('A', 'B', 'D') 826.74
#('B', 'A', 'A') 765.4
#('B', 'A', 'C') 514.5

I know this is a no-go for you, but I'm 99% sure that there is no direct way to do this.
I agree that there should be a boolean variable to disable this behavior and use the "regular" SQL-like one.

Tumular answered 2/2, 2018 at 19:21 Comment(0)
M
1

I found this post while debugging something similar. Very good post, and I really like the inclusion of boundary conditions!

Here's the code that accomplishes the initial goal:

r = df.groupby(group_cols, as_index=False).agg({'Value': 'sum'})

r.columns = ['_'.join(col).strip('_') for col in r.columns]

enter image description here

The downside of this solution is that it results in a hierarchical column index that you may want to flatten (especially if you have multiple statistics). I included flattening of column index in the code above.

I don't know why instance methods:

df.groupby(group_cols).sum() 
df.groupby(group_cols).mean()
df.groupby(group_cols).stdev()

use all unique combinations of categorical variables, while the .agg() method:

df.groupby(group_cols).agg(['count', 'sum', 'mean', 'std']) 

ignores the unused level combinations of the groups. That seems inconsistent. Just happy that we can use the .agg() method and not have to worry about a Cartesian combination explosion.

Also, I think it is very common to have a much lower unique cardinality count vs. the Cartesian product. Think of all the cases where data has columns like "State", "County", 'Zip"... these are all nested variables and many data sets out there have variables that have a high degree of nesting.

In our case the difference between Cartesian product of the grouping variables and the naturally occurring combinations is over 1000x (and the starting data set is over 1,000,000 rows).

Consequently, I would have voted for making observed=True the default behavior.

Moonstruck answered 2/7, 2018 at 19:57 Comment(0)
B
0

The main advantages of using categorical dtype are:

  • Memory efficiency. The data is stored as integer codes, which are smaller in size than strings, the category type requires less memory to store the same amount of data compared to object type or int type data.
  • Faster processing. Categorical data operations such as group by are generally faster than equivalent operations on object or int type data because they can be performed on the integer codes, which are more efficient to work with than strings.

The cons are:

  • group by output: the output of the groupby is very messy. a lof of Nan are generated depending on your categories values.
  • the same problem applies to the filtering.
  • concatenation issue with category type: the category type is linked to a dictionary of values so when you concatenate or merge you will have trouble and the loss of the category dtype.

You could have more in-depth information from this article: https://medium.com/gitconnected/pandas-category-type-pros-and-cons-1bcac1bdea71

Breland answered 27/1, 2023 at 22:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.