how to understand closed and label arguments in pandas resample method?
Asked Answered
E

4

21

Based on the pandas documentation from here: Docs

And the examples:

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>>> series = pd.Series(range(9), index=index)
>>> series
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

After resampling:

>>> series.resample('3T', label='right', closed='right').sum()
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15

In my thoughts, the bins should looks like these after resampling:

=========bin 01=========
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2

=========bin 02=========
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5

=========bin 03=========
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8

Am I right on this step??

So after .sum I thought it should be like this:

2000-01-01 00:02:00     3
2000-01-01 00:05:00    12
2000-01-01 00:08:00    21

I just do not understand how it comes out:

2000-01-01 00:00:00 0

(because label='right', 2000-01-01 00:00:00 cannot be any right edge of any bins in this case).

2000-01-01 00:09:00 15

(the label 2000-01-01 00:09:00 even does not exists in the original Series.

Enwreathe answered 19/1, 2018 at 11:50 Comment(4)
No, the closed='right' would indicate it's the other way.Pastel
@JohnE Thank you for replying. I understand the closed=right means the right edge is included in the intervals. Can you please let me know what are the 3 bins looks like after resampling by series.resample('3T', label='right', closed='right') ? I thought the 2000-01-01 00:00:00 0 should not appear after the .sum.Enwreathe
I thought I already addressed that in my answer but I'll try to make it more clearGigantopithecus
I think this is happening because you are resampling using '3T' which indicates a 3 minutes period for each sample and not 3 rowsOps
G
32

Short answer: If you use closed='left' and loffset='2T' then you'll get what you expected:

series.resample('3T', label='left', closed='left', loffset='2T').sum()

2000-01-01 00:02:00     3
2000-01-01 00:05:00    12
2000-01-01 00:08:00    21

Long answer: (or why the results you got were correct, given the arguments you used) This may not be clear from the documentation, but open and closed in this setting is about strict vs non-strict inequality (e.g. < vs <=).

An example should make this clear. Using an interior interval from your example, this is the difference from changing the value of closed:

closed='right' =>  ( 3:00, 6:00 ]  or  3:00 <  x <= 6:00
closed='left'  =>  [ 3:00, 6:00 )  or  3:00 <= x <  6:00

You can find an explanation of the interval notation (parentheses vs brackets) in many places like here, for example: https://en.wikipedia.org/wiki/Interval_(mathematics)

The label parameter merely controls whether the left (3:00) or right (6:00) side is displayed, but doesn't impact the results themselves.

Also note that you can change the starting point for the intervals with the loffset parameter (which should be entered as a time delta).

Back to the example, where we change only the labeling from 'right' to 'left':

series.resample('3T', label='right', closed='right').sum()

2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15

series.resample('3T', label='left', closed='right').sum()

1999-12-31 23:57:00     0
2000-01-01 00:00:00     6
2000-01-01 00:03:00    15
2000-01-01 00:06:00    15

As you can see, the results are the same, only the index label changes. Pandas only lets you display the right or left label, but if it showed both, then it would look like this (below I'm using standard index notation where ( on the left side means open and ] on the right side means closed):

( 1999-12-31 23:57:00, 2000-01-01 00:00:00 ]   0   # = 0
( 2000-01-01 00:00:00, 2000-01-01 00:03:00 ]   6   # = 1+2+3
( 2000-01-01 00:03:00, 2000-01-01 00:06:00 ]  15   # = 4+5+6
( 2000-01-01 00:06:00, 2000-01-01 00:09:00 ]  15   # =   7+8

Note that the first bin (23:57:00,00:00:00] is not empty, it's just that it contains a single row and the value in that single row is zero. If you change 'sum' to 'count' this becomes more obvious:

series.resample('3T', label='left', closed='right').count()

1999-12-31 23:57:00    1
2000-01-01 00:00:00    3
2000-01-01 00:03:00    3
2000-01-01 00:06:00    2
Gigantopithecus answered 19/1, 2018 at 13:21 Comment(2)
Do you know how to assign resample date as a new column to original dataframe ? for example , 300*24 rows with date column fromat like %Y-%m-%d %H (300days in hour ). I need groupby them every 7days , last day anchor to today, rolling back with 7* 24 . Because there are some reason I can't use resample.agg , so I need set the resampled date column back to original dataframeHayne
@Hayne that's an interesting question, and you might want to post as a new question to get a good and specific answer. Off the top of my head, I can see 2 alternative approaches: (1) use merge_asof to merge back to the original data, or (2) instead of resampling, use interpolateGigantopithecus
U
26

Per JohnE's answer I put together a little helpful infographic which should settle this issue once and for all:

enter image description here

Urbane answered 21/5, 2018 at 9:56 Comment(0)
I
4

It is important that resampling is performed by first producing a raster which is a sequence of instants (not periods, intervals, durations), and it is done independent of the 'label' and 'closed' parameters. It uses only the 'freq' parameter and 'loffset'. In your case, the system will produce the following raster:

2000-01-01 00:00:00
2000-01-01 00:03:00
2000-01-01 00:06:00
2000-01-01 00:09:00

Note again that at this moment there is no interpretation in terms of intervals or periods. You can shift it using 'loffset'.

Then the system will use the 'closed' parameter in ordre to choose among two options:

  • (start, end]

  • [start, end)

Here start and end are two adjacent time stamps in the raster. The 'label' parameter is used to choose whether start or end are used as a representative of the interval.

In your example, if you choose closed='right' then you will get the following intervals:

( previous_interval , 2000-01-01 00:00:00] - {0}
(2000-01-01 00:00:00, 2000-01-01 00:03:00] - {1,2,3}
(2000-01-01 00:03:00, 2000-01-01 00:06:00] - {1,2,3}
(2000-01-01 00:06:00, 2000-01-01 00:09:00] - {4,5,6}
(2000-01-01 00:09:00, next_interval ] - {7,8}

Note that after you aggregate the values over these intervals, the result is displayed in two versions depending on the 'label' parameter, that is, whether one and the same interval is represented by its left or right time stamp.

Ious answered 14/2, 2018 at 14:18 Comment(0)
D
-1

I now realized how it works, but still the strange thing about this is why the additional timestamp is added at the right side, which is counter-intuitive in a way. I guess this is similar to the range or iloc thing.

Duren answered 9/1, 2023 at 8:57 Comment(1)
This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From ReviewIrvinirvine

© 2022 - 2024 — McMap. All rights reserved.