I have a dataset that contains the feeding data of 3 animals, consisting of the animals' tag ids (1,2,3), the type (A,B) and amount (kg) of feed given at each 'meal':
Animal FeedType Amount(kg)
Animal1 A 10
Animal2 B 7
Animal3 A 4
Animal2 A 2
Animal1 B 5
Animal2 B 6
Animal3 A 2
In base R, I can easily output the matrix below which has unique('Animal')
as its rows, unique('FeedType')
as its columns and the cumulative Amount (kg)
in the corresponding cells of the matrix by using tapply()
as below
out <- with(mydf, tapply(Amount, list(Animal, FeedType), sum))
A B
Animal1 10 5
Animal2 2 13
Animal3 6 NA
Is there an equivalent functionality for a Python Pandas dataframe? What is the most elegant and fastest way to achieve this in Pandas?
P.S. I want to be able to specify on what column, in this case Amount
, to perform the aggregation.
Thanks in advance.
EDIT:
I tried both approaches in the two answers. Performance results with my actual Pandas data-frame of 216,347 rows and 15 columns:
start_time1 = timeit.default_timer()
mydf.groupby(['Animal','FeedType'])['Amount'].sum()
elapsed_groupby = timeit.default_timer() - start_time1
start_time2 = timeit.default_timer()
mydf.pivot_table(rows='Animal', cols='FeedType',values='Amount',aggfunc='sum')
elapsed_pivot = timeit.default_timer() - start_time2
print ('elapsed_groupby: ' + str(elapsed_groupby))
print ('elapsed_pivot: ' + str(elapsed_pivot))
gives:
elapsed_groupby: 10.172213
elapsed_pivot: 8.465783
So in my case, pivot_table() works faster.
groupBy()
? I could not figure out how to specify theAmount
column in thegroupBy()
approach. I sense that grouping by the entire data frame is unnecessary and may pose performance issues. I want to specifically usedf$Amount
while aggregating. – Weimaraner