Python Pandas: Is Order Preserved When Using groupby() and agg()?

M

6

81

I've frequented used pandas' agg() function to run summary statistics on every column of a data.frame. For example, here's how you would produce the mean and standard deviation:

df = pd.DataFrame({'A': ['group1', 'group1', 'group2', 'group2', 'group3', 'group3'],
                   'B': [10, 12, 10, 25, 10, 12],
                   'C': [100, 102, 100, 250, 100, 102]})

>>> df
[output]
        A   B    C
0  group1  10  100
1  group1  12  102
2  group2  10  100
3  group2  25  250
4  group3  10  100
5  group3  12  102

In both of those cases, the order that individual rows are sent to the agg function does not matter. But consider the following example, which:

df.groupby('A').agg([np.mean, lambda x: x.iloc[1] ])

[output]

        mean  <lambda>  mean  <lambda>
A                                     
group1  11.0        12   101       102
group2  17.5        25   175       250
group3  11.0        12   101       102

In this case the lambda functions as intended, outputting the second row in each group. However, I have not been able to find anything in the pandas documentation that implies that this is guaranteed to be true in all cases. I want use agg() along with a weighted average function, so I want to be sure that the rows that come into the function will be in the same order as they appear in the original data frame.

Does anyone know, ideally via somewhere in the docs or pandas source code, if this is guaranteed to be the case?

Malformation answered 19/10, 2014 at 22:31 Comment(2)

Yes, I can't see any guarantees that order is preserved in the docs, so it does seem a bit unwise to rely on it. If the ordering is reflected by your B column then you could sort each group by B within the lambda to make sure. – Daisy 19/10, 2014 at 23:38

Unfortunately I want to keep the rows ordered by a column that isn't included in the aggregation. The data frame is sorted before the agg() call, so it's only a problem if it reorders it as part of the groupby(). – Malformation 20/10, 2014 at 0:2

L

42

See this enhancement issue

The short answer is yes, the groupby will preserve the orderings as passed in. You can prove this by using your example like this:

In [20]: df.sort_index(ascending=False).groupby('A').agg([np.mean, lambda x: x.iloc[1] ])
Out[20]: 
           B             C         
        mean <lambda> mean <lambda>
A                                  
group1  11.0       10  101      100
group2  17.5       10  175      100
group3  11.0       10  101      100

This is NOT true for resample however as it requires a monotonic index (it WILL work with a non-monotonic index, but will sort it first).

Their is a sort= flag to groupby, but this relates to the sorting of the groups themselves and not the observations within a group.

FYI: df.groupby('A').nth(1) is a safe way to get the 2nd value of a group (as your method above will fail if a group has < 2 elements)

Larrigan answered 20/10, 2014 at 12:19 Comment(3)

Thanks for the clarification and the issue link! I originally used iloc as an example because I couldn't figure out how to pass in nth() to the agg() call (because at that point x is a series). Is there some way to call nth() other than as a DataFrame member function? – Malformation 20/10, 2014 at 22:9

nth is only defined on a groupby. What do you mean 'other than a DataFrame member function'? – Larrigan 20/10, 2014 at 22:13

I meant I couldn't figure out how to pass nth() as one of the functions sent in the list to agg(). You can't do .agg([np.mean, nth]), or DataFrame.nth() or lambda x: x.nth(2). That's what led my to iloc, though it will throw index errors. The best way is probably to not try to do it all in one step; first use nth() then use agg(), then merge them. – Malformation 21/10, 2014 at 3:21

S

53

In order to preserve order, you'll need to pass .groupby(..., sort=False). In your case the grouping column is already sorted, so it does not make difference, but generally one must use the sort=False flag:

 df.groupby('A', sort=False).agg([np.mean, lambda x: x.iloc[1] ])

Skindive answered 16/11, 2018 at 17:34 Comment(4)

There is a sort= flag to groupby, but this relates to the sorting of the groups themselves and not the observations within a group. – Hemicycle 14/8, 2019 at 8:3

they shoul've made this a default param, considering very often this gets used – Ardennes 9/9, 2020 at 12:58

It is ironic that the documentation says also "Get better performance by turning this off.". Well one more reason why it should have been an optional feature, not a default. Most importantly it makes changes to data that the caller might not expect. – Pikeperch 24/12, 2021 at 12:57

as of pandas version 1.5.3, sort has default True. which was the opposite of my expectation. – Crumby 15/2, 2023 at 4:0

L

42

See this enhancement issue

The short answer is yes, the groupby will preserve the orderings as passed in. You can prove this by using your example like this:

In [20]: df.sort_index(ascending=False).groupby('A').agg([np.mean, lambda x: x.iloc[1] ])
Out[20]: 
           B             C         
        mean <lambda> mean <lambda>
A                                  
group1  11.0       10  101      100
group2  17.5       10  175      100
group3  11.0       10  101      100

This is NOT true for resample however as it requires a monotonic index (it WILL work with a non-monotonic index, but will sort it first).

Their is a sort= flag to groupby, but this relates to the sorting of the groups themselves and not the observations within a group.

FYI: df.groupby('A').nth(1) is a safe way to get the 2nd value of a group (as your method above will fail if a group has < 2 elements)

Larrigan answered 20/10, 2014 at 12:19 Comment(3)

Thanks for the clarification and the issue link! I originally used iloc as an example because I couldn't figure out how to pass in nth() to the agg() call (because at that point x is a series). Is there some way to call nth() other than as a DataFrame member function? – Malformation 20/10, 2014 at 22:9

nth is only defined on a groupby. What do you mean 'other than a DataFrame member function'? – Larrigan 20/10, 2014 at 22:13

I meant I couldn't figure out how to pass nth() as one of the functions sent in the list to agg(). You can't do .agg([np.mean, nth]), or DataFrame.nth() or lambda x: x.nth(2). That's what led my to iloc, though it will throw index errors. The best way is probably to not try to do it all in one step; first use nth() then use agg(), then merge them. – Malformation 21/10, 2014 at 3:21

M

30

Panda's 0.19.1 doc says "groupby preserves the order of rows within each group", so this is guaranteed behavior.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html

Maharaja answered 3/12, 2016 at 17:11 Comment(0)

F

5

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

The API accepts "SORT" as an argument.

Description for SORT argument is like this:

sort : bool, default True Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

Thus, it is clear the "Groupby" does preserve the order of rows within each group.

Fiertz answered 3/4, 2019 at 13:17 Comment(0)

D

3

Unfortunately the answer to this question is NO. In the past few days I've created an algorithm for non-uniform chunking and found that is cannot possibly retain order because a groupby introduces subframes where the key to each frame is the groupby input. So you end up with:

allSubFrames = df.groupby("myColumnToOrderBy")
for orderKey, individualSubFrame in allSubFrames:
     do something...

Because its using dictionaries you lose the ordering.

If you perform a sort afterwards, as mentioned above, which I've just tested for a massive dataset, you end up with an O(n log n) computation.

However, I found that if you have for instance ordered time series data in order, where you want to preserve the order, it is better to change the ordering column into a list and then create a counter that records the first item in each time series. This results in a O(n) calculation.

So, essentially if you are using a relatively small dataset the proposed answers above are reasonable, but if using a big data set you need to consider avoiding groupby and sort. Instead use: list(df['myColumnToOrderBy']) and iterator over it.

Deponent answered 18/6, 2021 at 11:12 Comment(2)

Can you, please, add a simple working code example to your answer? – Slot 7/2, 2022 at 17:54

No, I can't because you need a massive dataset, and it took me a long time to sort this issue out and am no longer working on that project. – Deponent 7/3, 2022 at 13:59

B

-1

Even easier:

  import pandas as pd
  pd.pivot_table(df,index='A',aggfunc=(np.mean))

output:

            B    C
     A                
   group1  11.0  101
   group2  17.5  175
   group3  11.0  101

Baresark answered 15/3, 2016 at 0:5 Comment(0)

Recommended topics

Hot tags