Pandas resample with start date

Asked 9/6, 2020 at 19:14 Answered 17/6, 2020 at 14:16

Solved python pandas dataframe time-series pandas-resample

I'd like to resample a pandas object using a specific date (or month) as the edge of the first bin. For instance, in the following snippet I'd like my first index value to be 2020-02-29 and I'd be happy specifying start=2 or start="2020-02-29".

>>> dates = pd.date_range("2020-01-29", "2021-07-04")
>>> s = pd.Series(range(len(dates)), index=dates)
>>> s.resample('4M').count()
2020-01-31      3
2020-05-31    121
2020-09-30    122
2021-01-31    123
2021-05-31    120
2021-09-30     34
Freq: 4M, dtype: int64

So far this is the cleanest I can come up with uses pd.cut and groupby:

>>> rule = "4M"
>>> start = pd.Timestamp("2020-02-29") - pd.tseries.frequencies.to_offset(rule)
>>> end = s.index.max() + pd.tseries.frequencies.to_offset(rule)
>>> bins = pd.date_range(start, end, freq=rule)
>>> gb = s.groupby(pd.cut(s.index, bins)).count()
>>> gb.index = gb.index.categories.right
>>> gb
2020-02-29     32
2020-06-30    122
2020-10-31    123
2021-02-28    120
2021-06-30    122
2021-10-31      4
dtype: int64

Henriques answered 9/6, 2020 at 19:14 Comment(1)

You can use pd.cut(s.index, bins, labels=bins[1:]) to group; specifying the bins in the cut saves you the step of re-defining the index. Also since the day of the date is entirely irrelevant with a '4M' offset you can remove the ambiguity by specifying only the YM for the start: pd.Timestamp("2020-02"). Other than that your cut is pretty much the way to go. – Peerless 15/6, 2020 at 20:23

My answer feels a little hacky, but uses resample and gives the desired output. Find the date one bin length (e.g. 4 months, or month ends specifically) before the specified date, append it to s, and then resample:

rule = '4M'
date = '02-29-2020'

base_date = pd.to_datetime(date) - pd.tseries.frequencies.to_offset(rule)
s.loc[base_date] = np.nan
output = s.resample(rule=rule).count()
output=output[output.index >= date]

Result:

2020-02-29     32
2020-06-30    122
2020-10-31    123
2021-02-28    120
2021-06-30    122
2021-10-31      4
Freq: 4M, dtype: int64

I added output=output[output.index >= date] b/c otherwise you get an additional empty bin:

2019-10-31      0
2020-02-29     32
2020-06-30    122
2020-10-31    123
2021-02-28    120
2021-06-30    122
2021-10-31      4
Freq: 4M, dtype: int64

Tessera answered 12/6, 2020 at 1:3 Comment(3)

This is a neat trick and seems like the simplest code, because it uses resample. In a quickie timing test I find it to also be faster than the OP answer, but not as fast as the combo of @ALolz/MhdMedfa answers. – Jellyfish 17/6, 2020 at 14:57

output = output.loc[date:] or output[1:] would also work. – Liba 17/6, 2020 at 20:19

I like that this uses resample and the same aggregation method as my original problem statement. Thanks! – Henriques 18/6, 2020 at 13:30

All you need to use is pd.cut like below:

>>> gb = pd.cut(s.index, bins).value_counts()
>>> gb.index = gb.index.categories.right
>>> gb
2020-02-29     32
2020-06-30    122
2020-10-31    123
2021-02-28    120
2021-06-30    122
2021-10-31      4
dtype: int64

there is no need to use groupby

Toussaint answered 11/6, 2020 at 21:23 Comment(2)

Thanks for your feedback. As I understood @Henriques is looking for a cleaner way to get the same answer. Do you have a suggestion to make a better answer? Thanks again for your time – Toussaint 16/6, 2020 at 15:4

I'm just saying that you answer is potentially misleading in that it looks like you replaced 6 lines with 2, but your answer also requires 6 lines. You just shortened 1 out of the 6 lines (which is good, and it's also faster). My only suggestion is to be more clear about how your answer is different from, and improves upon, the answer in the OP. You could also show the speed increase. (I did this in my "answer", but the timing I used is a combo of your improvement and @ALollz's.) – Jellyfish 17/6, 2020 at 15:19

Another way when dealing with months intervals could be to convert the datetime index to an integer from year and month, remove the start_date defined and some modulo value with the rule. use this in a groupby.

rule = '4M'
start = "2020-02-29"

# change types of value
d = pd.Timestamp(start)
nb = int(rule[:-1])

gr = s.groupby(d+(1+((s.index.year*12+s.index.month) #convert datetime index to int
                      -(d.year*12+d.month+1))//nb) # remove start and modulo rule
                  *pd.tseries.frequencies.to_offset(rule) # get rule freq
              ).count()
print (gr)
2020-02-29     32
2020-06-30    121
2020-10-31    123
2021-02-28    120
2021-06-30    122
2021-10-31      4
dtype: int64

Now compared to your method, let's say you define a date you want not being within the first X months define by your rule like 2020-07-31 with the same rule (4M). with this method, it gives:

2020-03-31     63 #you get this interval
2020-07-31    121
2020-11-30    122
2021-03-31    121
2021-07-31     95
dtype: int64

while with your method, you get:

2020-07-31    121  #you loose info from before the 2020-03-31
2020-11-30    122
2021-03-31    121
2021-07-31     95
dtype: int64

I know you stated in the question that you define the first date but with this method you could define any date as long as the rule is in month

Branchia answered 13/6, 2020 at 2:22 Comment(0)

This is not an original answer but rather combines the improvements by @ALollz (comment) and @MhdMedf (answer) into a single answer for clarity as they represent compatible improvements. Also see below for a timing note.

rule = "4M"
start = pd.Timestamp("2020-02-29") - pd.tseries.frequencies.to_offset(rule)
end = s.index.max() + pd.tseries.frequencies.to_offset(rule)
bins = pd.date_range(start, end, freq=rule)
gb = pd.cut(s.index, bins, labels=bins[1:]).value_counts()

(The last line above replaces the last two lines of the answer in the OP. The first four lines are unchanged but included here for clarity.)

Results:

2020-02-29     32
2020-06-30    122
2020-10-31    123
2021-02-28    120
2021-06-30    122
2021-10-31      4

Speed/timing: The code in the OP takes a non-trivial amount of time considering there are only 524 rows (6ms on my machine). Using the OP data, these two improvements combine for about a 3x speedup. Of course, on a larger series/dataframe the timing results might differ substantially from those seen here.

Jellyfish answered 17/6, 2020 at 14:16 Comment(0)

Recommended topics

Hot tags