Aggregate rows with information from different columns
Here some example using multiindex (for simplicity). Of course you can shorten up the redundant outputs for (shop1-3, mean/std)
by some dict comprehension, but for simplicity I skipped that.
The special thing about this answer, is that we use aggregations operating on different columns simultaneously, for instance ("all_shops", "mean")
takes the mean over all grouped rows of the columns ['shop1', 'shop2', 'shop3']
.
input / output
df:
Item shop1 shop2 shop3 Category
0 Shoes 45 50 53 Clothes
1 TV 200 300 250 Technology
2 Book 20 17 21 Books
3 phone 300 350 400 Technology
df_agg:
general all_shops shop1 shop2 shop3
count mean std mean std mean std mean std
Category
Books 1.0 19.333333 1.699673 20.0 NaN 17.0 NaN 21.0 NaN
Clothes 1.0 49.333333 3.299832 45.0 NaN 50.0 NaN 53.0 NaN
Technology 2.0 300.000000 64.549722 250.0 70.710678 325.0 35.355339 325.0 106.066017
code
import numpy as np
import pandas as pd
if __name__ == "__main__":
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
df = pd.DataFrame([
["Shoes", 45, 50, 53, "Clothes"],
["TV", 200, 300, 250, "Technology"],
["Book", 20, 17, 21, "Books"],
["phone", 300, 350, 400, "Technology"],
], columns=["Item", "shop1", "shop2", "shop3", "Category"]
)
print(f"df:\n{df}")
df_agg = df.groupby("Category").apply(func=lambda df_gr: pd.Series({
("general", "count"): len(df_gr),
("all_shops", "mean"): df_gr[['shop1', 'shop2', 'shop3']].mean().mean(),
("all_shops", "std"): np.std(df_gr[['shop1', 'shop2', 'shop3']].to_numpy()),
("shop1", "mean"): df_gr['shop1'].mean(),
("shop1", "std"): df_gr['shop1'].std(),
("shop2", "mean"): df_gr['shop2'].mean(),
("shop2", "std"): df_gr['shop2'].std(),
("shop3", "mean"): df_gr['shop3'].mean(),
("shop3", "std"): df_gr['shop3'].std(),
}))
print(f"\ndf_agg:\n{df_agg}")