Pandas reindex and interpolate time series efficiently (reindex drops data)
Asked Answered
B

3

9

Suppose I wish to re-index, with linear interpolation, a time series to a pre-defined index, where none of the index values are shared between old and new index. For example

# index is all precise timestamps e.g. 2018-10-08 05:23:07
series = pandas.Series(data,index) 

# I want rounded date-times
desired_index = pandas.date_range("2010-10-08",periods=10,freq="30min") 

Tutorials/API suggest the way to do this is to reindex then fill NaN values using interpolate. But, as there is no overlap of datetimes between the old and new index, reindex outputs all NaN:

# The following outputs all NaN as no date times match old to new index
series.reindex(desired_index)

I do not want to fill nearest values during reindex as that will lose precision, so I came up with the following; concatenate the reindexed series with the original before interpolating:

pandas.concat([series,series.reindex(desired_index)]).sort_index().interpolate(method="linear")

This seems very inefficient, concatenating and then sorting the two series. Is there a better way?

Blubberhead answered 8/10, 2018 at 11:31 Comment(0)
M
14

The only (simple) way I can see of doing this is to use resample to upsample to your time resolution (say 1 second), then reindex.

Get an example DataFrame:

import numpy as np
import pandas as pd

np.random.seed(2)

df = (pd.DataFrame()
 .assign(SampleTime=pd.date_range(start='2018-10-01', end='2018-10-08', freq='30T')
                    + pd.to_timedelta(np.random.randint(-5, 5, size=337), unit='s'),
         Value=np.random.randn(337)
         )
 .set_index(['SampleTime'])
)

Let's see what the data looks like:

df.head()

                        Value
SampleTime
2018-10-01 00:00:03     0.033171
2018-10-01 00:30:03     0.481966
2018-10-01 01:00:01     -0.495496

Get the desired index:

desired_index = pd.date_range('2018-10-01', periods=10, freq='30T')

Now, reindex the data with the union of the desired and existing indices, interpolate based on the time, and reindex again using only the desired index:

(df
 .reindex(df.index.union(desired_index))
 .interpolate(method='time')
 .reindex(desired_index)
)

                        Value
2018-10-01 00:00:00     NaN
2018-10-01 00:30:00     0.481218
2018-10-01 01:00:00     -0.494952
2018-10-01 01:30:00     -0.103270

As you can see, you still have an issue with the first timestamp because it's outside the range of the original index; there are number of ways to deal with this (pad, for example).

Mammon answered 8/10, 2018 at 12:2 Comment(4)
Are resample and interpolate suitably lazy, or will they compute thousands of values I don't need?Blubberhead
Ah, yes, that was the caveat I forgot to add. resample is lazy but interpolate isn't, so if you have a lot of data that could take a while.Mammon
@SideshowBob I've corrected the answer to use interpolate(method='time'), which is vastly better.Mammon
Great, thanks. I don't know whether index.union always sorts or only if necessary, so it could still be a little inefficient if not - but at least now I feel like I'm using pandas like I'm supposed to so it's the library's fault not mine ;)Blubberhead
S
0

my methods

    frequency = nyse_trading_dates.rename_axis([None]).index
    
    df = prices.rename_axis([None]).reindex(frequency)

    for d in prices.rename_axis([None]).index:
        df.loc[d] = prices.loc[d]
        
    df.interpolate(method='linear')
    

method 2

    prices = data.loc[~data.index.duplicated(keep='last')]        
    #prices = data.reset_index()

    idx1 = prices.index  
    idx1 = pd.to_datetime(idx1, errors='coerce')

    merged = idx1.union(idx2)
    s = prices.reindex(merged)
    df = s.interpolate(method='linear').dropna(axis=0, how='any')

    data=df
Saccharify answered 7/3, 2021 at 18:27 Comment(0)
S
0

What I use is a combined index. The first step is to declare the desired index. Then, combine it with the existing index. Now reindex to the combined index and fill in NaNs with interpolation, ffill or any other function. Then reindex again only to the new index.

# Create a new time series with minute frequency
new_index = pd.date_range(temperature_data.index.min(), temperature_data.index.max(), freq='min')

#Create combined index for ffill or interpolation etc. 
combined_index = new_index.union(temperature_data.index)

#Reindex to combined index
resampled_ts = temperature_data.reindex(combined_index)

# Apply interpolation to fill missing values
resampled_ts = resampled_ts.interpolate(method='linear')

# Resample the combined time series to minute frequency
resampled_ts = resampled_ts.reindex(new_index)
Spinning answered 21/2, 2024 at 10:21 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.