I have a table that looks similar to this, with a multiindex on (A, B)
>>> columns = ["A", "B", "C", "D"]
... data = [
... [1, 1, 99, 22],
... [1, 2, 87, 24],
... [1, 3, 65, 31],
... [2, 1, 88, 30],
... [2, 2, 76, 33],
... [2, 3, 23, 32],
... [2, 4, 38, 28],
... [3, 1, 33, 40],
... [3, 2, 23, 41],
...]
>>>
>>> pd_table = pd.DataFrame(data=data, columns=columns)
>>> pd_table.set_index(["A", "B"], inplace=True)
>>> print(pd_table)
C D
A B
1 1 99 22
2 87 24
3 65 31
2 1 88 30
2 76 33
3 23 32
4 38 28
3 1 33 40
2 23 41
If i want to groupby the results on an index, and apply an aggregate function on the groups, I can do this by
>>> roll_table = pd_table.groupby("A").aggregate({"C": min, "D": max})
>>> print(roll_table)
C D
A
1 65 31
2 23 33
3 23 41
However, this drops the B index, which I would like to keep. I also want to apply a function to this column, but apparently this fails:
>>> roll_table = pd_table.groupby("A").aggregate({"B": max, "C": min, "D": max})
>>> print(roll_index)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...\Python\Python38\site-packages\pandas\core\groupby\generic.py", line 928, in aggregate
result, how = self._aggregate(func, *args, **kwargs)
File "...\Python\Python38\site-packages\pandas\core\base.py", line 357, in _aggregate
raise SpecificationError("nested renamer is not supported")
pandas.core.base.SpecificationError: nested renamer is not supported
One way to solve this is to move the B index to the columns, perform the aggregation, then move it back to the index, but this seems cumbersome:
>>> roll_table = pd_table.reset_index(level="B")
>>> roll_table = roll_table.groupby("A").aggregate({"B": max, "C": min, "D": max})
>>> roll_table = roll_table.set_index("B", append=True)
>>> print(roll_table)
C D
A B
1 3 65 31
2 4 23 33
3 2 23 41
Is there a way to make aggregate work on ungrouped indexes as well?
To give a use case example, the indexes may be coordinates, and I want to use the first y value as a reference point. Or I may want to use "size"
to keep track of how many values were grouped together.
>>> columns = ["x", "y", "Pressure"]
>>> data = [
... [ 1, 1, 99],
... [ 1, 2, 98],
... [ 1, 3, 101],
... [ 2, 2, 100],
... [ 2, 3, 96],
... [ 3, 1, 100],
... [ 3, 2, 102],
... [ 3, 3, 100],
... ]
>>>
>>> pd_table = pd.DataFrame(data=data, columns=columns)
>>> pd_table.set_index(["x", "y"], inplace=True)
>>>
>>> pd_table.reset_index(level="y", inplace=True)
>>> roll_index = pd_table.groupby("x").aggregate({"y": "first", "Pressure": "mean"})
>>> roll_index.set_index("y", append=True, inplace=True)
>>>
>>> print(roll_index)
Pressure
x y
1 1 99.333333
2 2 98.000000
3 1 100.666667
B
will be different for row 2 forC
andD
? unless you are certain that the min ofC
and the max ofD
will be on the same row for all groups, keepingB
doesnt sound possible. – HaptenB
indices. In my example this is{"B": max, ...}
part. It doesn't have to correspond to any of the values inC
orD
, nor does it have to be themax
function, it could beavg
, orfirst
. – Dvinapd.Grouper()
lets you specify levels of a MultiIndex, so you can aggregate by Index Levels and columns; pandas docs are here: pandas.pydata.org/pandas-docs/stable/user_guide/… – Baritone