Resample Pandas Dataframe Without Filling in Missing Times
Asked Answered
T

1

9

Resampling a dataframe can take the dataframe to either a higher or lower temporal resolution. Most of the time this is used to go to lower resolution (e.g. resample 1-minute data to monthly values). When the dataset is sparse (for example, no data were collected in Feb-2020) then the Feb-2020 row in will be filled with NaNs the resampled dataframe. The problem is when the data record is long AND sparse there are a lot of NaN rows, which makes the dataframe unnecessarily large and takes a lot of CPU time. For example, consider this dataframe and resample operation:

import numpy as np
import pandas as pd

freq1 = pd.date_range("20000101", periods=10, freq="S")
freq2 = pd.date_range("20200101", periods=10, freq="S")

index = np.hstack([freq1.values, freq2.values])
data = np.random.randint(0, 100, (20, 10))
cols = list("ABCDEFGHIJ")

df = pd.DataFrame(index=index, data=data, columns=cols)

# now resample to daily average
df = df.resample(rule="1D").mean()

Most of the data in this dataframe is useless and can be removed via:

df.dropna(how="all", axis=0, inplace=True)

however, this is sloppy. Is there another method to resample the dataframe that does not fill all of the data gaps with NaN (i.e. in the example above, the resultant dataframe would have only two rows)?

Twyla answered 9/7, 2020 at 15:4 Comment(0)
V
10

Updating my original answer with (what I think) is an improvement, plus updated times.

Use groupby

There are a couple ways you can use groupby instead of resample. In the case of a day ("1D") resampling, you can just use the date property of the DateTimeIndex:

df = df.groupby(df.index.date).mean()

This is in fact faster than the resample for your data:

%%timeit
df.resample(rule='1D').mean().dropna()
# 2.08 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.groupby(df.index.date).mean()
# 666 µs ± 15.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The more general approach would be to use the floor of the timestamps to do the groupby operation:

rule = '1D'
f = df.index.floor(rule)
df.groupby(f).mean()

#                A     B     C     D     E     F     G     H     I     J
# 2000-01-01  50.5  33.5  62.7  42.4  46.7  49.2  64.0  53.3  71.0  38.0
# 2020-01-01  50.4  56.3  57.4  46.2  55.0  60.2  60.3  57.8  63.5  47.3

This will work with more irregular frequencies as well. The main snag here is that by default, it seems like the floor is calculated in reference to some initial date, which can cause weird results (see my post):

rule = '7D'
f = df.index.floor(rule)
df.groupby(f).mean()

#                A     B     C     D     E     F     G     H     I     J
# 1999-12-30  50.5  33.5  62.7  42.4  46.7  49.2  64.0  53.3  71.0  38.0
# 2019-12-26  50.4  56.3  57.4  46.2  55.0  60.2  60.3  57.8  63.5  47.3

The major issue is that the resampling doesn't start on the earliest timestamp within your data. However, it is fixable using this solution to the above post:

# custom function for flooring relative to a start date
def floor(x, freq):
    offset = x[0].ceil(freq) - x[0]
    return (x + offset).floor(freq) - offset

rule = '7D'
f = floor(df.index, rule)
df.groupby(f).mean()

#                A     B     C     D     E     F     G     H     I     J
# 2000-01-01  50.5  33.5  62.7  42.4  46.7  49.2  64.0  53.3  71.0  38.0
# 2019-12-28  50.4  56.3  57.4  46.2  55.0  60.2  60.3  57.8  63.5  47.3

# the cycle of 7 days is now starting from 01-01-2000

Just note here that the function floor() is relatively slow compared to pandas.Series.dt.floor(). So it is best to us the latter if you can, but both are better than the original resample (in your example):

%%timeit
df.groupby(df.index.floor('1D')).mean()
# 1.06 ms ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df.groupby(floor(df.index, '1D')).mean()
# 1.42 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Vestryman answered 9/7, 2020 at 15:30 Comment(4)
While good for 1D frequencies I really want something more flexible that allows different resample frequencies. That being said, I didn't know I can specify df.index.date in groupby so thanks for the tip.Twyla
@Twyla If you are still interested, see my updated answer. There is a more general approach using floor which seems to be faster.Vestryman
In case it may help to anyone: in an xarray da with a time dimension you can apply f as f = da.time.dt.floor("1D").Inotropic
(It will rename your time dimension to floor, though).Inotropic

© 2022 - 2024 — McMap. All rights reserved.