Problem
I need to reduce the length of a DataFrame to some externally defined integer (could be two rows, 10,000 rows, etc., but will always be a reduction in overall length), but I also want to keep the resulting DataFrame representative of the original. The original DataFrame (we'll call it df
) has a datetime
column (utc_time
) and a data value column (data_value
). The datetimes are always sequential, non-repeating, though not evenly spaced (i.e., data might be "missing"). For the DataFrame in this example, the timestamps are at ten minute intervals (when data is present).
Attempts
To accomplish this, my mind immediately went to resampling with the following logic: find the difference in seconds between the first and last timestamps, divide that by the desired final length, and that's the resampling factor. I set this up here:
# Define the desired final length.
final_length = 2
# Define the first timestamp.
first_timestamp = df['utc_time'].min().timestamp()
# Define the last timestamp.
last_timestamp = df['utc_time'].max().timestamp()
# Define the difference in seconds between the first and last timestamps.
delta_t = last_timestamp - first_timestamp
# Define the resampling factor.
resampling_factor = np.ceil(delta_t / final_length)
# Set the index from the `utc_time` column so that we can resample nicely.
df.set_index('utc_time', drop=True, inplace=True)
# Do the resampling.
resamp = df.resample(f'{resampling_factor}S')
To look at resamp
, I simply looped and printed:
for i in resamp:
print(i)
This yielded (with some cleanup on my part) the following:
utc_time data_value
2016-09-28 21:10:00 140.0
2016-09-28 21:20:00 250.0
2016-09-28 21:30:00 250.0
2016-09-28 21:40:00 240.0
2016-09-28 21:50:00 240.0
... ...
2018-08-06 13:00:00 240.0
2018-08-06 13:10:00 240.0
2018-08-06 13:20:00 240.0
2018-08-06 13:30:00 240.0
2018-08-06 13:40:00 230.0
[69889 rows x 1 columns])
utc_time data_value
2018-08-06 13:50:00 230.0
2018-08-06 14:00:00 230.0
2018-08-06 14:10:00 230.0
2018-08-06 14:20:00 230.0
2018-08-06 14:30:00 230.0
... ...
2020-06-14 02:50:00 280.0
2020-06-14 03:00:00 280.0
2020-06-14 03:10:00 280.0
2020-06-14 03:20:00 280.0
2020-06-14 03:30:00 280.0
[97571 rows x 1 columns])
utc_time data_value
2020-06-14 03:40:00 280.0
2020-06-14 03:50:00 280.0
2020-06-14 04:00:00 280.0
2020-06-14 04:10:00 280.0
2020-06-14 04:20:00 280.0
... ...
2020-06-15 00:10:00 280.0
2020-06-15 00:20:00 270.0
2020-06-15 00:30:00 270.0
2020-06-15 00:40:00 270.0
2020-06-15 00:50:00 280.0
[128 rows x 1 columns])
As one can see, this produced three bins rather than the two I expected.
I could do something different, like changing the way I choose the resampling factor (e.g., finding the average time between timestamps, and multiplying that by (length of DataFrame / final_length
) should yield a more conservative resampling factor), but that would, to my mind, be a mask to the underlying issue. Mainly, I'd love to understand why what's happening is happening. Which leads to...
Question
Does anyone know why this is happening, and what steps I might take to ensure we get the desired number of bins? I wonder if it's an offsetting issue - that is, although we see the first timestamp in the first bin as the first timestamp from the DataFrame, perhaps pandas is actually starting the bin before then?
For anyone who'd like to play along at home, the test DataFrame can be found here as a .csv. To get it in as a DataFrame:
df = pd.read_csv('test.csv', parse_dates=[0])
kind
parameter also changed how the bins are defined in the first place! The docs didn't make that clear at all (or I vastly misinterpreted them). Thank you! – Humiliate