Passing function names as strings to Pandas GroupBy aggregrate
Asked Answered
F

2

5

In Pandas it is possible to tell how you want to aggregate your data by passing a string alias ('min' in the following example). From the docs, you have:

df.groupby('A').agg('min')

It is obvious what this is doing, but it really annoys me that I can't find anywhere in the docs a list of these string aliases and a description of what they do.

Does anyone knows a reference to these aliases?

Fiction answered 25/1, 2021 at 0:42 Comment(3)
In the function description for agg, it states that a function string name is acceptable.Letreece
@sammywemmy, it doesn't state which function the name refers to. It also doesn't show a list of available function names.Fiction
link for some of the functions. You probably also have to look through the API reference for pandas and numpy. If there is an aggregation function in numpy it can be used within pandas agg.Letreece
K
6

String method names can refer to any method of the object being operated on. Additionally, if the object has an __array__ attribute (as far as I can tell, if you're calling agg or transform directly, not with groupby, resample, rolling, etc), it can refer to anything in numpy's module-level namespace (e.g. anything in np.__all__). That's not to say that everything that can be referenced will work, but you can actually reference anything in either of these namespaces.

Examples

Here's an example dataframe:

In [9]: df = pd.DataFrame({'abc': list('aaaabbcccc'), 'data': np.random.random(size=10)})

In [10]: df
Out[10]:
  abc      data
0   a  0.800357
1   a  0.619654
2   a  0.448895
3   a  0.610645
4   b  0.985249
5   b  0.179411
6   c  0.173734
7   c  0.420767
8   c  0.789766
9   c  0.525486

DataFrame & Series methods with .agg and .transform

This can be aggregated or transformed using anything DataFrame methods (as long as the shape rules applying to agg and transform are followed).

Of course, there are the aggregation methods we're all familiar with:

In [93]: df.agg("sum")
Out[93]:
abc     aaaabbcccc
data      5.553964
dtype: object

But you could really give anything in the DataFrame/Series API a whirl:

In [95]: df.transform("shift")
Out[95]:
   abc      data
0  NaN       NaN
1    a  0.800357
2    a  0.619654
3    a  0.448895
4    a  0.610645
5    b  0.985249
6    b  0.179411
7    c  0.173734
8    c  0.420767
9    c  0.789766

In [102]: df.agg("dtypes")
Out[102]:
abc      object
data    float64
dtype: object

Numpy methods with .agg and .transform

Additionally, when working directly with pandas objects, we can use numpy global methods as well. Many of these don't work the way you might expect, so user beware:

In [101]: df.data.transform("expm1")
Out[101]:
0    1.226334
1    0.858285
2    0.566580
3    0.841620
4    1.678479
5    0.196512
6    0.189739
7    0.523129
8    1.202882
9    0.691281
Name: data, dtype: float64

In [103]: df.agg("rot90")
Out[103]:
array([[0.8003565068959021, 0.619653790821421, 0.44889504260755986,
        0.6106454343417287, 0.9852492020323964, 0.17941064387786554,
        0.17373389351532997, 0.42076690363942437, 0.7897663627044728,
        0.5254860156343195],
       ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c']], dtype=object)

In [107]: df.agg("meshgrid")
Out[107]:
[array(['a', 0.8003565068959021, 'a', 0.619653790821421, 'a',
        0.44889504260755986, 'a', 0.6106454343417287, 'b',
        0.9852492020323964, 'b', 0.17941064387786554, 'c',
        0.17373389351532997, 'c', 0.42076690363942437, 'c',
        0.7897663627044728, 'c', 0.5254860156343195], dtype=object)]

In [109]: df.agg("diag")
Out[109]: array(['a', 0.619653790821421], dtype=object)

Methods available to GroupBy, Window, and Resample operations

These numpy methods aren't available directly to Groupby, Rolling, Expanding, Resample, etc objects. But you can still call anything in the pandas API available to these objects:

In [117]: df.groupby('abc').agg("dtypes")
Out[117]:
        data
abc
a    float64
b    float64
c    float64

In [129]: df.groupby("abc").agg("ohlc")
Out[129]:
         data
         open      high       low     close
abc
a    0.800357  0.800357  0.448895  0.610645
b    0.985249  0.985249  0.179411  0.179411
c    0.173734  0.789766  0.173734  0.525486

In [137]: df.rolling(3).data.agg("quantile", 0.9)
Out[137]:
0         NaN
1         NaN
2    0.764216
3    0.617852
4    0.910328
5    0.910328
6    0.824081
7    0.372496
8    0.715966
9    0.736910
Name: data, dtype: float64

Note that the section of the pandas API which is relevant to the object scope is the Groupby, Window, or Resampling object itself, not the DataFrame or Series. So check the API of these objects for the full API reference.

Implementation

Buried deep in the pandas internals, you can trace the handling of string aggregation operations to a couple variations on this function, currently in pandas.core.apply._try_aggregate_string_function:


    def _try_aggregate_string_function(self, obj, arg: str, *args, **kwargs):
        """
        if arg is a string, then try to operate on it:
        - try to find a function (or attribute) on ourselves
        - try to find a numpy function
        - raise
        """
        assert isinstance(arg, str)

        f = getattr(obj, arg, None)
        if f is not None:
            if callable(f):
                return f(*args, **kwargs)

            # people may try to aggregate on a non-callable attribute
            # but don't let them think they can pass args to it
            assert len(args) == 0
            assert len([kwarg for kwarg in kwargs if kwarg not in ["axis"]]) == 0
            return f

        f = getattr(np, arg, None)
        if f is not None and hasattr(obj, "__array__"):
            # in particular exclude Window
            return f(obj, *args, **kwargs)

        raise AttributeError(
            f"'{arg}' is not a valid function for '{type(obj).__name__}' object"
        )

Similarly, in many places in the test suite and internals, the logic getattr(obj, f) is used, where obj is the data structure and f is the string function name.

Kopeck answered 19/7, 2022 at 19:4 Comment(0)
R
0

https://cmdlinetips.com/2019/10/pandas-groupby-13-functions-to-aggregate/

This link provides 13 functions for agg. However, you can also use lambda functions. For example,

df = pd.DataFrame({"A": [1, 1, 2, 2,],
    "B": [1, 2, 3, 4],
    "C": [0.362838, 0.227877, 1.267767, -0.562860],})
df.groupby('A').agg(lambda x:sum(x))
Roxanneroxburgh answered 25/1, 2021 at 0:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.