Resample Pandas Dataframe Without Filling in Missing Times

import numpy as np import pandas as pd freq1 = pd.date_range("20000101", periods=10, freq="S") freq2 = pd.date_range("20200101", periods=10, freq="S") index = np.hstack([freq1.values, freq2.values]) data = np.random.randint(0, 100, (20, 10)) cols = list("ABCDEFGHIJ") df = pd.DataFrame(index=index, data=data, columns=cols) # now resample to daily average df = df.resample(rule="1D").mean()

Updating my original answer with (what I think) is an improvement, plus updated times.

Use `groupby`

There are a couple ways you can use groupby instead of resample. In the case of a day ("1D") resampling, you can just use the date property of the DateTimeIndex:

df = df.groupby(df.index.date).mean()

This is in fact faster than the resample for your data:

%%timeit
df.resample(rule='1D').mean().dropna()
# 2.08 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.groupby(df.index.date).mean()
# 666 µs ± 15.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The more general approach would be to use the floor of the timestamps to do the groupby operation:

rule = '1D'
f = df.index.floor(rule)
df.groupby(f).mean()

#                A     B     C     D     E     F     G     H     I     J
# 2000-01-01  50.5  33.5  62.7  42.4  46.7  49.2  64.0  53.3  71.0  38.0
# 2020-01-01  50.4  56.3  57.4  46.2  55.0  60.2  60.3  57.8  63.5  47.3

This will work with more irregular frequencies as well. The main snag here is that by default, it seems like the floor is calculated in reference to some initial date, which can cause weird results (see my post):

rule = '7D'
f = df.index.floor(rule)
df.groupby(f).mean()

#                A     B     C     D     E     F     G     H     I     J
# 1999-12-30  50.5  33.5  62.7  42.4  46.7  49.2  64.0  53.3  71.0  38.0
# 2019-12-26  50.4  56.3  57.4  46.2  55.0  60.2  60.3  57.8  63.5  47.3

The major issue is that the resampling doesn't start on the earliest timestamp within your data. However, it is fixable using this solution to the above post:

# custom function for flooring relative to a start date
def floor(x, freq):
    offset = x[0].ceil(freq) - x[0]
    return (x + offset).floor(freq) - offset

rule = '7D'
f = floor(df.index, rule)
df.groupby(f).mean()

#                A     B     C     D     E     F     G     H     I     J
# 2000-01-01  50.5  33.5  62.7  42.4  46.7  49.2  64.0  53.3  71.0  38.0
# 2019-12-28  50.4  56.3  57.4  46.2  55.0  60.2  60.3  57.8  63.5  47.3

# the cycle of 7 days is now starting from 01-01-2000

Just note here that the function floor() is relatively slow compared to pandas.Series.dt.floor(). So it is best to us the latter if you can, but both are better than the original resample (in your example):

%%timeit
df.groupby(df.index.floor('1D')).mean()
# 1.06 ms ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df.groupby(floor(df.index, '1D')).mean()
# 1.42 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Use `groupby`

Recommended topics

Hot tags

Use groupby

Recommended topics

Hot tags

Use `groupby`