Pandas aggregate with dynamic column names

Asked 18/9, 2019 at 13:48 Answered 18/9, 2019 at 16:54

Solved python pandas aggregate pandas-groupby

I have a script that generates a pandas data frame with a varying number of value columns. As an example, this df might be

import pandas as pd
df = pd.DataFrame({
'group': ['A', 'A', 'A', 'B', 'B'],
'group_color' : ['green', 'green', 'green', 'blue', 'blue'],
'val1': [5, 2, 3, 4, 5], 
'val2' : [4, 2, 8, 5, 7]
})

  group group_color  val1  val2
0     A       green     5     4
1     A       green     2     2
2     A       green     3     8
3     B        blue     4     5
4     B        blue     5     7

My goal is to get the grouped mean for each of the value columns. In this specific case (with 2 value columns), I can use

df.groupby('group').agg({"group_color": "first", "val1": "mean", "val2": "mean"})

      group_color      val1      val2
group                                
A           green  3.333333  4.666667
B            blue  4.500000  6.000000

but that does not work when the data frame in question has more value columns (val3, val4 etc.). Is there a way to dynamically take the mean of "all the other columns" or "all columns containing val in their names"?

Oocyte answered 18/9, 2019 at 13:48 Comment(2)

is group_color always the same for one group? – Cheliform 18/9, 2019 at 13:56

@QuangHoang: yes, that is the case, but I would still like to retain it – Oocyte 18/9, 2019 at 14:2

More easy like

df.groupby('group').agg(lambda x : x.head(1) if x.dtype=='object' else x.mean())
Out[63]: 
      group_color      val1      val2
group                                
A           green  3.333333  4.666667
B            blue  4.500000  6.000000

Derrick answered 18/9, 2019 at 14:0 Comment(2)

nice solution! Can you explain why the dtype of the non-numeric columns is object? – Oocyte 18/9, 2019 at 15:20

@Oocyte this is a dtype in panda, string and others all classified as object – Derrick 18/9, 2019 at 15:26

If your group_color is always the same within one group, you can do:

df.pivot_table(index=['group','group_color'],aggfunc='mean')

Output:

                       val1      val2
group group_color                    
A     green        3.333333  4.666667
B     blue         4.500000  6.000000

In the other case, you can build the dictionary and pass it to agg:

agg_dict = {f: 'first' if f=='group_color' else 'mean' for f in df.columns[1:]}
df.groupby('group').agg(agg_dict)

Which output:

      group_color      val1      val2
group                                
A           green  3.333333  4.666667
B            blue  4.500000  6.000000

Cheliform answered 18/9, 2019 at 14:0 Comment(1)

You're pivot_table answer is the way to go. I used almost the same thing but added a reset_index. – Watchdog 18/9, 2019 at 16:59

Unfortunately you will have to apply both aggregation functions separately (that or repeat "valn": "mean" as many times as valx columns). Groupby.agg can take a dictionary but the keys must be individual columns.

The way I'd do this is using DataFrame.filter to select the subset of the dataframe with the columns following the format of valx, aggregate with the mean, and then assign new columns with the aggregated results on the other columns:

(df.filter(regex=r'^val').groupby(df.group).mean()
   .assign(color = df.group_color.groupby(df.group).first()))

         val1      val2    color
group                           
A      3.333333  4.666667  green
B      4.500000  6.000000   blue

Recrement answered 18/9, 2019 at 13:53 Comment(0)

Per OP's comment

We can group by both 'group' and 'group_color' without the risk of there being more than one unique 'group_color' per 'group'

Consequently:

df.groupby(['group', 'group_color']).mean().reset_index(level=1)

      group_color      val1      val2
group                                
A           green  3.333333  4.666667
B            blue  4.500000  6.000000

Watchdog answered 18/9, 2019 at 16:54 Comment(0)

You can go with 2 dictionaries that you can combine like this:

df.groupby('group').agg({**{'group_color': 'first'}, **{c: 'mean' for c in df.columns if c.startswith('val')}})

In this case you have one dict with fixed aggregations and other with dynamic column selection.

Torytoryism answered 18/9, 2019 at 14:7 Comment(0)

Recommended topics

Hot tags