How to resample a dataframe with different functions applied to each column?

Asked 4/4, 2012 at 23:17 Answered 22/7, 2012 at 19:13

I have a times series with temperature and radiation in a pandas dataframe. The time resolution is 1 minute in regular steps.

import datetime
import pandas as pd
import numpy as np

date_times = pd.date_range(datetime.datetime(2012, 4, 5, 8, 0),
                           datetime.datetime(2012, 4, 5, 12, 0),
                           freq='1min')
tamb = np.random.sample(date_times.size) * 10.0
radiation = np.random.sample(date_times.size) * 10.0
frame = pd.DataFrame(data={'tamb': tamb, 'radiation': radiation},
                     index=date_times)
frame
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 241 entries, 2012-04-05 08:00:00 to 2012-04-05 12:00:00
Freq: T
Data columns:
radiation    241  non-null values
tamb         241  non-null values
dtypes: float64(2)

How can I down-sample this dataframe to a resolution of one hour, computing the hourly mean for the temperature and the hourly sum for radiation?

Hewet answered 4/4, 2012 at 23:17 Comment(0)

With pandas 0.18 the resample API changed (see the docs). So for pandas >= 0.18 the answer is:

In [31]: frame.resample('1H').agg({'radiation': np.sum, 'tamb': np.mean})
Out[31]: 
                         tamb   radiation
2012-04-05 08:00:00  5.161235  279.507182
2012-04-05 09:00:00  4.968145  290.941073
2012-04-05 10:00:00  4.478531  317.678285
2012-04-05 11:00:00  4.706206  335.258633
2012-04-05 12:00:00  2.457873    8.655838

Old Answer:

I am answering my question to reflect the time series related changes in pandas >= 0.8 (all other answers are outdated).

Using pandas >= 0.8 the answer is:

In [30]: frame.resample('1H', how={'radiation': np.sum, 'tamb': np.mean})
Out[30]: 
                         tamb   radiation
2012-04-05 08:00:00  5.161235  279.507182
2012-04-05 09:00:00  4.968145  290.941073
2012-04-05 10:00:00  4.478531  317.678285
2012-04-05 11:00:00  4.706206  335.258633
2012-04-05 12:00:00  2.457873    8.655838

Hewet answered 22/7, 2012 at 19:13 Comment(8)

This can be extended to a list of functions per column: frame.resample('1H', how={'radiation': [np.sum, np.min], 'tamb': np.mean}). The resulting DataFrame has a MultiIndex on its columns, with the original column name as level 0 and the function name as level 1. – Flavoprotein 26/5, 2015 at 23:36

To add to my previous comment: instead of a list of functions per column, you can also use a dictionary, where the key is the new column name and the value is the function to use: frame.resample('1H', how={'radiation': {'sum_rad': np.sum, 'min_rad': np.min}, 'tamb': np.mean}) – Flavoprotein 27/5, 2015 at 20:27

say if you want to add a NEW column into the result, such as count() of each row in the resample period. – Funerary 29/3, 2016 at 19:45

@codingknob: Sorry, I don't undestand your comment. – Hewet 8/4, 2016 at 12:24

@Hewet , what if I want to aggregate by a custom function that combines data from two columns. For instance, what if I want to aggregate by np.mean(frame['radiation'] * frame['tamb']) ? – Kowalczyk 23/7, 2017 at 12:37

what you will advise in case of .bfill()? np.bfill() doesn't exist – Gnatcatcher 31/1, 2020 at 11:20

frame.resample() no longer accepts 'how' for pandas >1 – September 15/2, 2021 at 21:52

@CatherineNosova I'm afraid the solution based on agg only works for downsampling use cases (where there is something to aggregate). This answer doesn't seem to address the upsampling case, as also discussed in this question. – Seta 7/10, 2021 at 20:14

You can also downsample using the asof method of pandas.DateRange objects.

In [21]: hourly = pd.DateRange(datetime.datetime(2012, 4, 5, 8, 0),
...                          datetime.datetime(2012, 4, 5, 12, 0),
...                          offset=pd.datetools.Hour())

In [22]: frame.groupby(hourly.asof).size()
Out[22]: 
key_0
2012-04-05 08:00:00    60
2012-04-05 09:00:00    60
2012-04-05 10:00:00    60
2012-04-05 11:00:00    60
2012-04-05 12:00:00    1
In [23]: frame.groupby(hourly.asof).agg({'radiation': np.sum, 'tamb': np.mean})
Out[23]: 
                     radiation  tamb 
key_0                                
2012-04-05 08:00:00  271.54     4.491
2012-04-05 09:00:00  266.18     5.253
2012-04-05 10:00:00  292.35     4.959
2012-04-05 11:00:00  283.00     5.489
2012-04-05 12:00:00  0.5414     9.532

Dray answered 5/4, 2012 at 1:45 Comment(0)

To tantalize you, in pandas 0.8.0 (under heavy development in the timeseries branch on GitHub), you'll be able to do:

In [5]: frame.convert('1h', how='mean')
Out[5]: 
                     radiation      tamb
2012-04-05 08:00:00   7.840989  8.446109
2012-04-05 09:00:00   4.898935  5.459221
2012-04-05 10:00:00   5.227741  4.660849
2012-04-05 11:00:00   4.689270  5.321398
2012-04-05 12:00:00   4.956994  5.093980

The above mentioned methods are the right strategy with the current production version of pandas.

Subsequent answered 7/4, 2012 at 20:34 Comment(3)

Thanks, but what I want to have would be something like frame.convert('1h', how={'radiation': 'sum, 'tamb': 'mean'}). Is this an option in 0.8? – Hewet 8/4, 2012 at 10:30

@ Wes McKinney this should be resample in 0.8, isn't it? – Hewet 15/6, 2012 at 20:58

If you would update your answer, I would accept it. otherwise you should remove it I think, because it will point users to the wrong direction. – Hewet 13/10, 2012 at 8:23

You need to use groupby as such:

grouped = frame.groupby(lambda x: x.hour)
grouped.agg({'radiation': np.sum, 'tamb': np.mean})
# Same as: grouped.agg({'radiation': 'sum', 'tamb': 'mean'})

with the output being:

        radiation      tamb
key_0                      
8      298.581107  4.883806
9      311.176148  4.983705
10     315.531527  5.343057
11     288.013876  6.022002
12       5.527616  8.507670

So in essence I am splitting on the hour value and then calculating the mean of tamb and the sum of radiation and returning back the DataFrame (similar approach to R's ddply). For more info I would check the documentation page for groupby as well as this blog post.

Edit: To make this scale a bit better you could group on both the day and time as such:

grouped = frame.groupby(lambda x: (x.day, x.hour))
grouped.agg({'radiation': 'sum', 'tamb': 'mean'})
          radiation      tamb
key_0                        
(5, 8)   298.581107  4.883806
(5, 9)   311.176148  4.983705
(5, 10)  315.531527  5.343057
(5, 11)  288.013876  6.022002
(5, 12)    5.527616  8.507670

Flu answered 5/4, 2012 at 1:11 Comment(0)

Recommended topics

Hot tags