Equivalent of R's tapply() in Python Pandas
Asked Answered
W

2

7

I have a dataset that contains the feeding data of 3 animals, consisting of the animals' tag ids (1,2,3), the type (A,B) and amount (kg) of feed given at each 'meal':

Animal   FeedType   Amount(kg)
Animal1     A         10
Animal2     B         7
Animal3     A         4
Animal2     A         2
Animal1     B         5
Animal2     B         6
Animal3     A         2

In base R, I can easily output the matrix below which has unique('Animal') as its rows, unique('FeedType') as its columns and the cumulative Amount (kg) in the corresponding cells of the matrix by using tapply() as below

out <- with(mydf, tapply(Amount, list(Animal, FeedType), sum))

         A  B
Animal1 10  5
Animal2  2 13
Animal3  6 NA

Is there an equivalent functionality for a Python Pandas dataframe? What is the most elegant and fastest way to achieve this in Pandas?

P.S. I want to be able to specify on what column, in this case Amount, to perform the aggregation.

Thanks in advance.

EDIT:

I tried both approaches in the two answers. Performance results with my actual Pandas data-frame of 216,347 rows and 15 columns:

start_time1 = timeit.default_timer()
mydf.groupby(['Animal','FeedType'])['Amount'].sum()
elapsed_groupby = timeit.default_timer() - start_time1

start_time2 = timeit.default_timer()
mydf.pivot_table(rows='Animal', cols='FeedType',values='Amount',aggfunc='sum')
elapsed_pivot = timeit.default_timer() - start_time2

print ('elapsed_groupby: ' + str(elapsed_groupby))
print ('elapsed_pivot: ' + str(elapsed_pivot))

gives:

elapsed_groupby: 10.172213
elapsed_pivot: 8.465783

So in my case, pivot_table() works faster.

Weimaraner answered 3/1, 2014 at 14:21 Comment(0)
C
6

The approach of @Zelazny7 with groupby and unstack is certainly fine, but for completeness, you can also do this directly with pivot_table (see doc) [version 0.13 and below]:

In [13]: df.pivot_table(rows='Animal', cols='FeedType', values='Amount(kg)', aggfunc='sum')
Out[13]:
FeedType   A   B
Animal
Animal1   10   5
Animal2    2  13
Animal3    6 NaN

In newer versions of Pandas (version 0.14 and latter), arguments of pivot_table have been changed:

In [13]: df.pivot_table(index='Animal', columns='FeedType', values='Amount(kg)', aggfunc='sum')
Out[13]:
FeedType   A   B
Animal
Animal1   10   5
Animal2    2  13
Animal3    6 NaN
Clere answered 3/1, 2014 at 14:52 Comment(4)
Thank you, Is this faster than groupBy()? I could not figure out how to specify the Amount column in the groupBy() approach. I sense that grouping by the entire data frame is unnecessary and may pose performance issues. I want to specifically use df$Amount while aggregating.Weimaraner
Some quick timeit results show the groupby approach to be slightly faster (2.44ms versus 3.28ms with 100 loops)Ronrona
@Zelazny7, I tested both approaches using timeit as you suggested. pivot_table() was faster in my case as I added to my question. But I am grateful for your answer anyways.Weimaraner
@Weimaraner If you are using IPython, for timings I find it easier to use %timeit (eg %timeit mydf.groupby(['Animal','FeedType'])['Amount'].sum())Clere
R
7

First I read in your data:

In [7]: df = pd.read_clipboard(sep="\s+", index_col=False)

In [8]: df
Out[8]:
    Animal FeedType  Amount(kg)
0  Animal1        A          10
1  Animal2        B           7
2  Animal3        A           4
3  Animal2        A           2
4  Animal1        B           5
5  Animal2        B           6
6  Animal3        A           2

Then I can groupby the two columns to aggregate:

In [9]: df.groupby(['Animal','FeedType']).sum()
Out[9]:
                  Amount(kg)
Animal  FeedType
Animal1 A                 10
        B                  5
Animal2 A                  2
        B                 13
Animal3 A                  6

To get it in the same format, I can unstack the dataframe:

In [10]: df.groupby(['Animal','FeedType']).sum().unstack()
Out[10]:
          Amount(kg)
FeedType           A   B
Animal
Animal1           10   5
Animal2            2  13
Animal3            6 NaN
Ronrona answered 3/1, 2014 at 14:35 Comment(3)
Thank you for your answer! How can I specify: df.groupby(['Animal','FeedType']).sum() to sum over 'Amount' specifically? I have other float columns in the same dataframe and I am only interested in the 'Amount' values.Weimaraner
Do df.groupby(['Animal','FeedType'])['Amount'].sum()Clere
Yes, sorry I figured this out after I wrote my comment. But does that mean I am grouping by all columns that are not in ['Animal', 'FeedType'] in the first place? I did not specify this in my question to keep it brief and to the point but I indeed have many columns and would like to limit the operation to a single column if possible.Weimaraner
C
6

The approach of @Zelazny7 with groupby and unstack is certainly fine, but for completeness, you can also do this directly with pivot_table (see doc) [version 0.13 and below]:

In [13]: df.pivot_table(rows='Animal', cols='FeedType', values='Amount(kg)', aggfunc='sum')
Out[13]:
FeedType   A   B
Animal
Animal1   10   5
Animal2    2  13
Animal3    6 NaN

In newer versions of Pandas (version 0.14 and latter), arguments of pivot_table have been changed:

In [13]: df.pivot_table(index='Animal', columns='FeedType', values='Amount(kg)', aggfunc='sum')
Out[13]:
FeedType   A   B
Animal
Animal1   10   5
Animal2    2  13
Animal3    6 NaN
Clere answered 3/1, 2014 at 14:52 Comment(4)
Thank you, Is this faster than groupBy()? I could not figure out how to specify the Amount column in the groupBy() approach. I sense that grouping by the entire data frame is unnecessary and may pose performance issues. I want to specifically use df$Amount while aggregating.Weimaraner
Some quick timeit results show the groupby approach to be slightly faster (2.44ms versus 3.28ms with 100 loops)Ronrona
@Zelazny7, I tested both approaches using timeit as you suggested. pivot_table() was faster in my case as I added to my question. But I am grateful for your answer anyways.Weimaraner
@Weimaraner If you are using IPython, for timings I find it easier to use %timeit (eg %timeit mydf.groupby(['Animal','FeedType'])['Amount'].sum())Clere

© 2022 - 2024 — McMap. All rights reserved.