I've frequented used pandas' agg()
function to run summary statistics on every column of a data.frame. For example, here's how you would produce the mean and standard deviation:
df = pd.DataFrame({'A': ['group1', 'group1', 'group2', 'group2', 'group3', 'group3'],
'B': [10, 12, 10, 25, 10, 12],
'C': [100, 102, 100, 250, 100, 102]})
>>> df
[output]
A B C
0 group1 10 100
1 group1 12 102
2 group2 10 100
3 group2 25 250
4 group3 10 100
5 group3 12 102
In both of those cases, the order that individual rows are sent to the agg function does not matter. But consider the following example, which:
df.groupby('A').agg([np.mean, lambda x: x.iloc[1] ])
[output]
mean <lambda> mean <lambda>
A
group1 11.0 12 101 102
group2 17.5 25 175 250
group3 11.0 12 101 102
In this case the lambda functions as intended, outputting the second row in each group. However, I have not been able to find anything in the pandas documentation that implies that this is guaranteed to be true in all cases. I want use agg()
along with a weighted average function, so I want to be sure that the rows that come into the function will be in the same order as they appear in the original data frame.
Does anyone know, ideally via somewhere in the docs or pandas source code, if this is guaranteed to be the case?
B
column then you could sort each group byB
within the lambda to make sure. – Daisyagg()
call, so it's only a problem if it reorders it as part of thegroupby()
. – Malformation