Python pandas groupby aggregate on multiple columns, then pivot

Asked 2/4, 2017 at 20:3 Answered 18/4, 2023 at 15:41

Solved python pandas dataframe pivot data-cleaning

In Python, I have a pandas DataFrame similar to the following:

Item | shop1 | shop2 | shop3 | Category
------------------------------------
Shoes| 45    | 50    | 53    | Clothes
TV   | 200   | 300   | 250   | Technology
Book | 20    | 17    | 21    | Books
phone| 300   | 350   | 400   | Technology

Where shop1, shop2 and shop3 are the costs of every item in different shops. Now, I need to return a DataFrame, after some data cleaning, like this one:

Category (index)| size| sum| mean | std
----------------------------------------

where size is the number of items in each Category and sum, mean and std are related to the same functions applied to the 3 shops. How can I do these operations with the split-apply-combine pattern (groupby, aggregate, apply,...) ?

Can someone help me out? I'm going crazy with this one...thank you!

Vtarj answered 2/4, 2017 at 20:3 Comment(0)

Edited for Pandas 0.22+ considering the deprecation of the use of dictionaries in a group by aggregation.

We set up a very similar dictionary where we use the keys of the dictionary to specify our functions and the dictionary itself to rename the columns.

rnm_cols = dict(size='Size', sum='Sum', mean='Mean', std='Std')
df.set_index(['Category', 'Item']).stack().groupby('Category') \
  .agg(rnm_cols.keys()).rename(columns=rnm_cols)

            Size   Sum        Mean        Std
Category                                     
Books          3    58   19.333333   2.081666
Clothes        3   148   49.333333   4.041452
Technology     6  1800  300.000000  70.710678

option 1
use agg ← link to docs

agg_funcs = dict(Size='size', Sum='sum', Mean='mean', Std='std')
df.set_index(['Category', 'Item']).stack().groupby(level=0).agg(agg_funcs)

                  Std   Sum        Mean  Size
Category                                     
Books        2.081666    58   19.333333     3
Clothes      4.041452   148   49.333333     3
Technology  70.710678  1800  300.000000     6

option 2
more for less
use describe ← link to docs

df.set_index(['Category', 'Item']).stack().groupby(level=0).describe().unstack()

            count        mean        std    min    25%    50%    75%    max
Category                                                                   
Books         3.0   19.333333   2.081666   17.0   18.5   20.0   20.5   21.0
Clothes       3.0   49.333333   4.041452   45.0   47.5   50.0   51.5   53.0
Technology    6.0  300.000000  70.710678  200.0  262.5  300.0  337.5  400.0

Formally answered 2/4, 2017 at 23:27 Comment(2)

thanks for your answer @piRSquared, if we want to apply multiple functions for the same column dictionary wouldn't work. Is there any way to handle this? – Khan 21/3, 2018 at 22:32

@Khan This uses groupby and aggregation on a Pandas Series. It behaves differently for a DataFrame. – Formally 21/3, 2018 at 22:48

df.groupby('Category').agg({'Item':'size','shop1':['sum','mean','std'],'shop2':['sum','mean','std'],'shop3':['sum','mean','std']})

Or if you want it across all shops then:

df1 = df.set_index(['Item','Category']).stack().reset_index().rename(columns={'level_2':'Shops',0:'costs'})
df1.groupby('Category').agg({'Item':'size','costs':['sum','mean','std']})

Convenience answered 2/4, 2017 at 20:30 Comment(0)

Edited for Pandas 0.22+ considering the deprecation of the use of dictionaries in a group by aggregation.

We set up a very similar dictionary where we use the keys of the dictionary to specify our functions and the dictionary itself to rename the columns.

rnm_cols = dict(size='Size', sum='Sum', mean='Mean', std='Std')
df.set_index(['Category', 'Item']).stack().groupby('Category') \
  .agg(rnm_cols.keys()).rename(columns=rnm_cols)

            Size   Sum        Mean        Std
Category                                     
Books          3    58   19.333333   2.081666
Clothes        3   148   49.333333   4.041452
Technology     6  1800  300.000000  70.710678

option 1
use agg ← link to docs

agg_funcs = dict(Size='size', Sum='sum', Mean='mean', Std='std')
df.set_index(['Category', 'Item']).stack().groupby(level=0).agg(agg_funcs)

                  Std   Sum        Mean  Size
Category                                     
Books        2.081666    58   19.333333     3
Clothes      4.041452   148   49.333333     3
Technology  70.710678  1800  300.000000     6

option 2
more for less
use describe ← link to docs

df.set_index(['Category', 'Item']).stack().groupby(level=0).describe().unstack()

            count        mean        std    min    25%    50%    75%    max
Category                                                                   
Books         3.0   19.333333   2.081666   17.0   18.5   20.0   20.5   21.0
Clothes       3.0   49.333333   4.041452   45.0   47.5   50.0   51.5   53.0
Technology    6.0  300.000000  70.710678  200.0  262.5  300.0  337.5  400.0

Formally answered 2/4, 2017 at 23:27 Comment(2)

thanks for your answer @piRSquared, if we want to apply multiple functions for the same column dictionary wouldn't work. Is there any way to handle this? – Khan 21/3, 2018 at 22:32

@Khan This uses groupby and aggregation on a Pandas Series. It behaves differently for a DataFrame. – Formally 21/3, 2018 at 22:48

Aggregate rows with information from different columns

Here some example using multiindex (for simplicity). Of course you can shorten up the redundant outputs for (shop1-3, mean/std) by some dict comprehension, but for simplicity I skipped that.

The special thing about this answer, is that we use aggregations operating on different columns simultaneously, for instance ("all_shops", "mean") takes the mean over all grouped rows of the columns ['shop1', 'shop2', 'shop3'].

input / output

df:
    Item  shop1  shop2  shop3    Category
0  Shoes     45     50     53     Clothes
1     TV    200    300    250  Technology
2   Book     20     17     21       Books
3  phone    300    350    400  Technology

df_agg:
           general   all_shops             shop1             shop2             shop3            
             count        mean        std   mean        std   mean        std   mean         std
Category                                                                                        
Books          1.0   19.333333   1.699673   20.0        NaN   17.0        NaN   21.0         NaN
Clothes        1.0   49.333333   3.299832   45.0        NaN   50.0        NaN   53.0         NaN
Technology     2.0  300.000000  64.549722  250.0  70.710678  325.0  35.355339  325.0  106.066017

code

import numpy as np
import pandas as pd


if __name__ == "__main__":
    pd.set_option('display.max_rows', 500)
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)

    df = pd.DataFrame([
        ["Shoes", 45, 50, 53, "Clothes"],
        ["TV", 200, 300, 250, "Technology"],
        ["Book", 20, 17, 21, "Books"],
        ["phone", 300, 350, 400, "Technology"],
        ], columns=["Item", "shop1", "shop2", "shop3", "Category"]
    )
    print(f"df:\n{df}")

    df_agg = df.groupby("Category").apply(func=lambda df_gr: pd.Series({
        ("general", "count"): len(df_gr),
        ("all_shops", "mean"): df_gr[['shop1', 'shop2', 'shop3']].mean().mean(),
        ("all_shops", "std"): np.std(df_gr[['shop1', 'shop2', 'shop3']].to_numpy()),
        ("shop1", "mean"): df_gr['shop1'].mean(),
        ("shop1", "std"): df_gr['shop1'].std(),
        ("shop2", "mean"): df_gr['shop2'].mean(),
        ("shop2", "std"): df_gr['shop2'].std(),
        ("shop3", "mean"): df_gr['shop3'].mean(),
        ("shop3", "std"): df_gr['shop3'].std(),
    }))
    print(f"\ndf_agg:\n{df_agg}")

Hydraulic answered 18/4, 2023 at 15:41 Comment(0)

If I understand correctly, you want to calculate aggregate metrics for all shops, not for each individually. To do that, you can first stack your dataframe and then group by Category:

stacked = df.set_index(['Item', 'Category']).stack().reset_index()
stacked.columns = ['Item', 'Category', 'Shop', 'Price']
stacked.groupby('Category').agg({'Price':['count','sum','mean','std']})

Which results in

           Price                             
           count   sum        mean        std
Category                                     
Books          3    58   19.333333   2.081666
Clothes        3   148   49.333333   4.041452
Technology     6  1800  300.000000  70.710678

Palaeogene answered 2/4, 2017 at 20:40 Comment(0)

Aggregate rows with information from different columns

input / output

code

Recommended topics

Hot tags