Pandas: resample timeseries with groupby

Asked 14/8, 2015 at 14:4 Answered 15/6, 2023 at 11:16

Solved python pandas group-by time-series

Given the below pandas DataFrame:

In [115]: times = pd.to_datetime(pd.Series(['2014-08-25 21:00:00','2014-08-25 21:04:00',
                                            '2014-08-25 22:07:00','2014-08-25 22:09:00']))
          locations = ['HK', 'LDN', 'LDN', 'LDN']
          event = ['foo', 'bar', 'baz', 'qux']
          df = pd.DataFrame({'Location': locations,
                             'Event': event}, index=times)
          df
Out[115]:
                               Event Location
          2014-08-25 21:00:00  foo   HK
          2014-08-25 21:04:00  bar   LDN
          2014-08-25 22:07:00  baz   LDN
          2014-08-25 22:09:00  qux   LDN

I would like resample the data to aggregate it hourly by count while grouping by location to produce a data frame that looks like this:

Out[115]:
                               HK    LDN
          2014-08-25 21:00:00  1     1
          2014-08-25 22:00:00  0     2

I've tried various combinations of resample() and groupby() but with no luck. How would I go about this?

Naidanaiditch answered 14/8, 2015 at 14:4 Comment(1)

For those coming to this question in 2017+, pd.TimeGrouper is deprecated. See this answer for the latest syntax. – Denicedenie 21/11, 2017 at 2:1

105

In my original post, I suggested using pd.TimeGrouper. Nowadays, use pd.Grouper instead of pd.TimeGrouper. The syntax is largely the same, but TimeGrouper is now deprecated in favor of pd.Grouper.

Moreover, while pd.TimeGrouper could only group by DatetimeIndex, pd.Grouper can group by datetime columns which you can specify through the key parameter.

You could use a pd.Grouper to group the DatetimeIndex'ed DataFrame by hour:

grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])

use count to count the number of events in each group:

grouper['Event'].count()
#                      Location
# 2014-08-25 21:00:00  HK          1
#                      LDN         1
# 2014-08-25 22:00:00  LDN         2
# Name: Event, dtype: int64

use unstack to move the Location index level to a column level:

grouper['Event'].count().unstack()
# Out[49]: 
# Location             HK  LDN
# 2014-08-25 21:00:00   1    1
# 2014-08-25 22:00:00 NaN    2

and then use fillna to change the NaNs into zeros.

Putting it all together,

grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])
result = grouper['Event'].count().unstack('Location').fillna(0)

yields

Location             HK  LDN
2014-08-25 21:00:00   1    1
2014-08-25 22:00:00   0    2

Cribbing answered 14/8, 2015 at 14:10 Comment(2)

If you have your DatetimeIndex in a MultiIndex, use pd.Grouper(level=datetime_level_name, freq='1H') – Leopard 6/12, 2018 at 23:23

In version 0.24.1 I had to write the parameter freq: grouper = df.groupby([pd.Grouper(freq='1H'), 'Location']) – Nazarene 20/5, 2019 at 9:5

Pandas 0.21 answer: TimeGrouper is getting deprecated

There are two options for doing this. They actually can give different results based on your data. The first option groups by Location and within Location groups by hour. The second option groups by Location and hour at the same time.

Option 1: Use groupby + resample

grouped = df.groupby('Location').resample('H')['Event'].count()

Option 2: Group both the location and DatetimeIndex together with groupby(pd.Grouper)

grouped = df.groupby(['Location', pd.Grouper(freq='H')])['Event'].count()

They both will result in the following:

Location                     
HK        2014-08-25 21:00:00    1
LDN       2014-08-25 21:00:00    1
          2014-08-25 22:00:00    2
Name: Event, dtype: int64

And then reshape:

grouped.unstack('Location', fill_value=0)

Will output

Location             HK  LDN
2014-08-25 21:00:00   1    1
2014-08-25 22:00:00   0    2

Denicedenie answered 27/8, 2016 at 22:34 Comment(7)

Sorry, I know this is a old question. How would you do this for a multiple column groupby now that TimeGrouper is getting deprecated? Also, would be possible to specify start and end dates in addition to the intervals? – Gibe 5/10, 2017 at 23:17

@Gibe TimeGrouper did nothing that Grouper couldn't do. Just use a list as done with option 2. You can only specify dates at regular intervals with the date offset aliases. If you have non-regular dates you will have to do some preprocessing and simply label each with each group it belongs to and then use groupby with those groups. – Denicedenie 5/10, 2017 at 23:21

That makes sense. So just to clarify on the non-regular dates. I have a desired date range of 2004-01-01-12/31/2018 with 30 day intervals. There are no guarantees that there will be values for every groupby key for every time interval. Would this be a "non-regular date" problem or would I be set with this approach? – Gibe 5/10, 2017 at 23:38

30 day intervals are regular dates. If some of your groupby keys have different dates then each option in this answer will give you different results. It will depend on what result you desire. If you are still confused just go ahead and create a new question. – Denicedenie 5/10, 2017 at 23:53

Thanks for the tips. I did end up creating a new question: #46612126 – Gibe 6/10, 2017 at 18:11

@TedPetrou Can you give an example as to how the Option 1 and Option 2 that you present would give different results for the same source data? As long as they are parameterized equivalently (offset aliases, etc.), how would that be the case? – Langan 14/11, 2018 at 3:45

Note that Option 1 interpolates/gap-fills while Option 2 doesn't. – Octane 12/6, 2019 at 9:59

Multiple Column Group By

untubu is spot on with his answer but I wanted to add in what you could do if you had a third column, say Cost and wanted to aggregate it like above. It was through combining unutbu's answer and this one that I found out how to do this and thought I would share for future users.

Create a DataFrame with Cost column:

In[1]:
import pandas as pd
import numpy as np
times = pd.to_datetime([
    "2014-08-25 21:00:00", "2014-08-25 21:04:00",
    "2014-08-25 22:07:00", "2014-08-25 22:09:00"
])
df = pd.DataFrame({
    "Location": ["HK", "LDN", "LDN", "LDN"],
    "Event":    ["foo", "bar", "baz", "qux"],
    "Cost":     [20, 24, 34, 52]
}, index = times)
df

Out[1]:
                     Location  Event  Cost
2014-08-25 21:00:00        HK    foo    20
2014-08-25 21:04:00       LDN    bar    24
2014-08-25 22:07:00       LDN    baz    34
2014-08-25 22:09:00       LDN    qux    52

Now we group by using the agg function to specify each column's aggregation method, e.g. count, mean, sum, etc.

In[2]:
grp = df.groupby([pd.Grouper(freq = "1H"), "Location"]) \
      .agg({"Event": np.size, "Cost": np.mean})
grp

Out[2]:
                               Event  Cost
                     Location
2014-08-25 21:00:00  HK            1    20
                     LDN           1    24
2014-08-25 22:00:00  LDN           2    43

Then the final unstack with fill NaN with zeros and display as int because it's nice.

In[3]: 
grp.unstack().fillna(0).astype(int)

Out[3]:
                    Event     Cost
Location               HK LDN   HK LDN
2014-08-25 21:00:00     1   1   20  24
2014-08-25 22:00:00     0   2    0  43

Alta answered 15/4, 2016 at 10:19 Comment(0)

If you want to keep all the columns

df = (df.groupby("Location")
      .resample("H", on="date")
      .last()
      .reset_index(drop=True))

Retain answered 20/5, 2021 at 16:7 Comment(0)

This can be done without using resample or Grouper as follows:

df.groupby([df.index.floor("1H"), "Location"]).count()

Melquist answered 17/9, 2019 at 21:16 Comment(2)

@Sitology I disagree - the other solutions use Grouper - this one does not. Can you point to the answer that this one duplicates? – Melquist 4/10, 2019 at 19:3

I noticed and tried to remove down vote but system doesn't allow to change 🤷🏻‍♂️ – Sitology 5/10, 2019 at 20:26

pd.resample >>> pd.groupby() seems to be the quite faster by many times

df = 

PJMW_MW
Datetime    
2002-04-01 01:00:00     4374.0
2002-04-01 02:00:00     4306.0
2002-04-01 03:00:00     4322.0
2002-04-01 04:00:00     4359.0
2002-04-01 05:00:00     4436.0
...     ...
2017-12-31 19:00:00     8205.0
2017-12-31 20:00:00     8053.0
2017-12-31 21:00:00     8012.0
2017-12-31 22:00:00     7864.0
2017-12-31 23:00:00     7710.0

138066 rows × 1 columns

%timeit df.resample(rule='24H', kind='interval').mean()
3.45 ms ± 41.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.groupby(df.index.strftime('%Y-%m-%d')).mean()
169 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pruett answered 15/6, 2023 at 11:16 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Pandas 0.21 answer: TimeGrouper is getting deprecated

Multiple Column Group By

If you want to keep all the columns

Recommended topics

Hot tags