Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
Asked Answered
B

11

796

I have a dataframe df and I use several columns from it to groupby:

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

In the above way, I almost get the table (dataframe) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.

In short: How do I get group-wise statistics for a dataframe?

Blowfish answered 15/10, 2013 at 15:0 Comment(0)
W
631

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
Weathers answered 15/10, 2013 at 15:49 Comment(5)
This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)Taliesin
Please see my answer if you want to get only one count column per group.Unhallowed
What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.Potentiometer
@Taliesin result = df['col1','col2','col3','col4'].groupby(['col1', 'col2']).mean() ; counts = times.groupby(['col1', 'col2']).size() ; result['count'] = countsDeservedly
this is not working if the group by is a date and formated as datetime.Bullhead
U
1480

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(['col1','col2']).size()


Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')


If you want to find out how to calculate the row counts and other statistics for each group continue reading below.


Detailed example:

Consider the following example dataframe:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let's use .size() to get the row counts:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let's use .size().reset_index(name='counts') to get the row counts:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 


Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.

Unhallowed answered 26/9, 2015 at 19:34 Comment(8)
Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chainingPterous
Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: https://mcmap.net/q/55210/-how-to-flatten-a-hierarchical-index-in-columnsLangton
Great! Could you please give me a hint how to add isnull to this query to have it in one column as well? 'col4': ['median', 'min', 'count', 'isnull']Hetaerism
How do you broadcast the count? (I can't make transform work when using groupby(...).sizeJacktar
Instead of reset_index, another way to get a DataFrame is to use the as_index parameter: df.groupby(['col1','col2'], as_index=False).size(). I use the as_index=False as a habit for all my groupbys.Reign
If you wanted to sort descendently by a column? (counts)Groundsel
The use of join intrigued me, but don't you think the following gives you the equivalent thing in an easier way? Abbreviated code because in a comment: gb.agg(counts=('col3', "size"), col3_mean=('col3', 'mean'), col4_median=('col4', 'median')).reset_index()Marielamariele
@Hetaerism isnull is not an aggregation, it's a scalar to scalar function. Compare sum(1, 2, 3) produces one value (6). Isnull produces a value per input.Marielamariele
W
631

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
Weathers answered 15/10, 2013 at 15:49 Comment(5)
This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.)Taliesin
Please see my answer if you want to get only one count column per group.Unhallowed
What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.Potentiometer
@Taliesin result = df['col1','col2','col3','col4'].groupby(['col1', 'col2']).mean() ; counts = times.groupby(['col1', 'col2']).size() ; result['count'] = countsDeservedly
this is not working if the group by is a date and formated as datetime.Bullhead
B
136

Swiss Army Knife: GroupBy.describe

Returns count, mean, std, and other useful statistics per-group.

df.groupby(['A', 'B'])['C'].describe()

           count  mean   std   min   25%   50%   75%   max
A   B                                                     
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

To get specific statistics, just select them,

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

           count      mean
A   B                     
bar one      1.0  0.400157
    three    1.0  2.240893
    two      1.0 -0.977278
foo one      2.0  1.357070
    three    1.0 -0.151357
    two      2.0  1.423148

Note: if you only need to compute 1 or 2 stats then it might be faster to use groupby.agg and just compute those columns otherwise you are performing wasteful computation.

describe works for multiple columns (change ['C'] to ['C', 'D']—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).

You also get different statistics for string data. Here's an example,

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)

with pd.option_context('precision', 2):
    display(df2.groupby(['A', 'B'])
               .describe(include='all')
               .dropna(how='all', axis=1))

              C                                                   D                
          count  mean       std   min   25%   50%   75%   max count unique top freq
A   B                                                                              
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

For more information, see the documentation.


pandas >= 1.1: DataFrame.value_counts

This is available from pandas 1.1 if you just want to capture the size of every group, this cuts out the GroupBy and is faster.

df.value_counts(subset=['col1', 'col2'])

Minimal Example

# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

df.value_counts(['A', 'B']) 

A    B    
foo  two      2
     one      2
     three    1
bar  two      1
     three    1
     one      1
dtype: int64

Other Statistical Analysis Tools

If you didn't find what you were looking for above, the User Guide has a comprehensive listing of supported statical analysis, correlation, and regression tools.

Brandebrandea answered 7/4, 2019 at 22:38 Comment(1)
By doing .describe()[['count', 'mean']] you compute statistics that you would drop afterwards. Using .agg(['count', 'mean'] is a better option, about 7 times faster, as you only compute the ones actually neededPaton
T
19

To get multiple stats, collapse the index, and retain column names:

df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df

Produces:

**enter image description here**

Tabithatablature answered 13/11, 2019 at 1:31 Comment(1)
New users: warning this inserts spaces in column names. I think that should be avoided, generally speaking, so replace line 2 df.columns = [ '_'.join(str(i) for i in col) for col in df.columns] .Dastardly
C
12

We can easily do it by using groupby and count. But, we should remember to use reset_index().

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()
Cacodemon answered 27/11, 2017 at 9:17 Comment(0)
A
6

Please try this code

new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df

I think that code will add a column called 'count it' which count of each group

Amaliaamalie answered 8/2, 2020 at 1:34 Comment(0)
N
4

Create a group object and call methods like below example:

grp = df.groupby(['col1',  'col2',  'col3']) 

grp.max() 
grp.mean() 
grp.describe() 
Neral answered 11/4, 2019 at 14:5 Comment(0)
S
3

If you are familiar with tidyverse R packages, here is a way to do it in python:

from datar.all import tibble, rnorm, f, group_by, summarise, mean, n, rep

df = tibble(
  col1=rep(['A', 'B'], 5), 
  col2=rep(['C', 'D'], each=5), 
  col3=rnorm(10), 
  col4=rnorm(10)
)
df >> group_by(f.col1, f.col2) >> summarise(
  count=n(),
  col3_mean=mean(f.col3), 
  col4_mean=mean(f.col4)
)
  col1 col2  n  mean_col3  mean_col4
0    A    C  3  -0.516402   0.468454
1    A    D  2  -0.248848   0.979655
2    B    C  2   0.545518  -0.966536
3    B    D  3  -0.349836  -0.915293
[Groups: ['col1'] (n=2)]

I am the author of the datar package. Please feel free to submit issues if you have any questions about using it.

Smitt answered 29/4, 2021 at 17:51 Comment(0)
M
0

Another alternative:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df

    A   B       C           D
0   foo one   0.808197   2.057923
1   bar one   0.330835  -0.815545
2   foo two  -1.664960  -2.372025
3   bar three 0.034224   0.825633
4   foo two   1.131271  -0.984838
5   bar two   2.961694  -1.122788
6   foo one   -0.054695  0.503555
7   foo three 0.018052  -0.746912

pd.crosstab(df.A, df.B).stack().reset_index(name='count')

Output:

    A   B     count
0   bar one     1
1   bar three   1
2   bar two     1
3   foo one     2
4   foo three   1
5   foo two     2
Michaelamichaele answered 5/1, 2023 at 18:28 Comment(0)
S
0

pivot_table with specific aggfuncs

For a dataframe of aggregate statistics, pivot_table can be used as well. It produces a table not too dissimilar from Excel pivot table. The basic idea is to pass in the columns to be aggregated as values= and grouper columns as index= and whatever aggregator functions as aggfunc= (all of the optimized functions that are admissible for groupby.agg are OK).

One advantage of pivot_table over groupby.agg is that for multiple columns it produces a single size column whereas groupby.agg which creates a size column for each column (all except one are redundant).

agg_df = df.pivot_table(
    values=['col3', 'col4', 'col5'], 
    index=['col1', 'col2'], 
    aggfunc=['size', 'mean', 'median']
).reset_index()
# flatten the MultiIndex column (should be omitted if MultiIndex is preferred)
agg_df.columns = [i if not j else f"{j}_{i}" for i,j in agg_df.columns]

res1

Use named aggregation for custom column names

For custom column names, instead of multiple rename calls, use named aggregation from the beginning.

From the docs:

To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

  • The keywords are the output column names
  • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

As an example, to produce aggregate dataframe where each of col3, col4 and col5 has its mean and count computed, the following code could be used. Note that it does the renaming columns step as part of groupby.agg.

aggfuncs = {f'{c}_{f}': (c, f) for c in ['col3', 'col4', 'col5'] for f in ['mean', 'count']}
agg_df = df.groupby(['col1', 'col2'], as_index=False).agg(**aggfuncs)

res3

Another use case of named aggregation is if each column needs a different aggregator function. For example, if only the mean of col3, median of col4 and min of col5 are needed with custom column names, it can be done using the following code.

agg_df = df.groupby(['col1', 'col2'], as_index=False).agg(col3_mean=('col3', 'mean'), col4_median=('col4', 'median'), col5_min=('col5', 'min'))
# or equivalently,
agg_df = df.groupby(['col1', 'col2'], as_index=False).agg(**{'_'.join(p): p for p in [('col3', 'mean'), ('col4', 'median'), ('col5', 'min')]})

res2

Shaduf answered 20/3, 2023 at 6:57 Comment(0)
D
0
df_group = (df.groupby(['city','gender'])[['age',"grade"]]
     .agg([('average','mean'),('freq','count')])
     .reset_index())

city and gender are the groups columns.

age and grade are the the depend columns (which I calculate mean on them).

the name of the mean column would be average the name of the count column would be freq

enter image description here

Deprave answered 18/6, 2023 at 13:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.