Fast EMA calculation on large dataset with irregular time intervals

Asked 9/7, 2019 at 16:19 Answered 7/8, 2024 at 11:7

I have data that has 800,000+ rows. I want to take an Exponential Moving Average (EMA) of one of the columns. The times are not evenly sampled and I want to decay the EMA on each update (row). The code I have is this:

window = 5            
for i in range(1, len(series)):
    dt = series['datetime'][i] - series['datetime'][i - 1]
    decay = 1 - numpy.exp(-dt / window)
    result[i] = (1 - decay) * result[i - 1] + decay * series['midpoint'].iloc[i]
return pandas.Series(result, index=series.index)

The problem is, for 800,000 rows, this is very slow. Is there anyway to optimize this using some other features of numpy? I can't vectorize it because results[i] is dependent on results[i-1].

sample data here:

Timestamp             Midpoint
1559655000001096130    2769.125
1559655000001162260    2769.127
1559655000001171688    2769.154
1559655000001408734    2769.138
1559655000001424200    2769.123
1559655000001433128    2769.110
1559655000001541560    2769.125
1559655000001640406    2769.125
1559655000001658436    2769.127
1559655000001755924    2769.129
1559655000001793266    2769.125
1559655000001878688    2769.143
1559655000002061024    2769.125

Mendy answered 9/7, 2019 at 16:19 Comment(6)

Why don't you just use the ewm function in pandas? pandas.pydata.org/pandas-docs/stable/reference/api/… – Matsuyama 9/7, 2019 at 16:25

the alpha is not the same across all rows. – Mendy 9/7, 2019 at 16:35

Can you please provide some sample data, e.g. 5-10 rows? – Matsuyama 9/7, 2019 at 16:55

added to original post – Mendy 9/7, 2019 at 17:19

At the very least you can vectorize calculating decays as a list... – Hyperploid 9/7, 2019 at 17:22

Is this financial data? That is, are there natural gaps in the data that should be ignored? For example, one would have a large dt value on Monday morning from the preceding Friday. – Matsuyama 9/7, 2019 at 17:22

How about something like the following which takes me 0.34 seconds to run on a series of irregularly spaced data with 900k rows? I am assuming the window of 5 implies a 5 day span.

First, let's create some sample data.

# Create sample data for a price stream of 2.6m price observations sampled 1 second apart.
seconds_per_day = 60 * 60 * 24  # 60 seconds / minute * 60 minutes / hour * 24 hours / day
starting_value = 100
annualized_vol = .3
sampling_percentage = .35  # 35%
start_date = '2018-12-01'
end_date = '2018-12-31'

np.random.seed(0)
idx = pd.date_range(start=start_date, end=end_date, freq='s')  # One second intervals.
periodic_vol = annualized_vol * (1/ 252 / seconds_per_day) ** 0.5
daily_returns = np.random.randn(len(idx)) * periodic_vol
cumulative_indexed_return = (1 + daily_returns).cumprod() * starting_value
index_level = pd.Series(cumulative_indexed_return, index=idx)

# Sample 35% of the simulated prices to create a time series of 907k rows with irregular time intervals.
s = index_level.sample(frac=sampling_percentage).sort_index()

Now let's create a generator function to store the latest value of the exponentially weighted time series. This can run c. 4x faster by installing numba, importing it and then adding the single decorator line above the function definition @jit(nopython=True).

from numba import jit  # Optional, see below.

@jit(nopython=True)  # Optional, see below.
def ewma(vals, decay_vals):
    result = vals[0]
    yield result
    for val, decay in zip(vals[1:], decay_vals[1:]):
        result = result * (1 - decay) + val * decay
        yield result

Now let's run this generator on the irregularly spaced series s. For this sample with 900k rows, it takes me 1.2 seconds to run the following code. I can further cut down the execution time to 0.34 seconds by optionally using the the just in time compiler from numba. You first need to install that package, e.g. conda install numba. Note that I used a list compehension to populate the ewma values from the generator, and then I assign these values back to the original series after first converting it to a dataframe.

# Assumes time series data is now named `s`.
window = 5  # Span of 5 days?
dt = pd.Series(s.index).diff().dt.total_seconds().div(seconds_per_day)  # Measured in days.
decay = (1 - (dt / -window).apply(np.exp))
g = ewma_generator(s.values, decay.values)
result = s.to_frame('midpoint').assign(
    ewma=pd.Series([next(g) for _ in range(len(s))], index=s.index))

>>> result.tail()
                       midpoint        ewma
2018-12-30 23:59:45  103.894471  105.546004
2018-12-30 23:59:49  103.914077  105.545929
2018-12-30 23:59:50  103.901910  105.545910
2018-12-30 23:59:53  103.913476  105.545853
2018-12-31 00:00:00  103.910422  105.545720

>>> result.shape
(907200, 2)

To make sure the numbers follow our intuition, let's visualize the result taking hourly samples. This looks good to me.

obs_per_day = 24  # 24 hourly observations per day.
step = int(seconds_per_day / obs_per_day)
>>> result.iloc[::step, :].plot()

Matsuyama answered 10/7, 2019 at 0:16 Comment(0)

A slight improvement may be obtained by iterating on the underlying numpy arrays instead of on pandas DataFrames and Series:

result = np.ndarray(len(series))
window = 5
serdt = series['datetime'].values
sermp = series['midpoint'].values
for i in range(1, len(series)):
    dt = serdt[i] - serdt[i - 1]
    decay = 1 - numpy.exp(-dt / window)
    result[i] = (1 - decay) * result[i - 1] + decay * sermp[i]
return pandas.Series(result, index=series.index)

With your sample data it is about 6 times faster that the original method.

Maite answered 9/7, 2019 at 17:52 Comment(1)

How would I write the loop in C++ and call it in Python? I know the Boost library can be used but is there a recommended way? – Mendy 9/7, 2019 at 20:24

You can actually vectorize it. Here is a vectorized version using torch, you can swap torch for numpy anytime, just replace "torch" with "np":

Let us consider some time indexed values:

values = [22, 23, 21, 22, 24, 25, 26, 27, 28, 29, 30]
dates = ['2024-01-01', '2024-01-03', '2024-01-06', '2024-01-08', '2024-01-11',
     '2024-01-15', '2024-01-17', '2024-01-20', '2024-01-25', '2024-01-30', '2024-02-04']

We build our tensors and convert the dates to millisecond timestamps:

import torch
import pandas as pd
t_values = torch.tensor(values, dtype=torch.float64)
t_dates = torch.tensor(pd.to_datetime(dates).astype(int) / 10**6 , dtype=torch.int64)

Define a time window to build the decay factor:

time_widow = 1000*3600*24*10 # 10 days
alpha = 1-torch.exp(-ts_values.diff()/time_widow)
alpha = torch.cat([torch.ones(1), alpha])
beta = 1-alpha

Note that I am also considering a more generic framework where the decay factor is no longer a constant but a time series itself, depending on the timestamps series. The result for the EMA of the last date is given by:

# flip then cumprod then reflip to put in the right order
beta_cumprod=beta.flip(0).cumprod(0).flip(0)
beta_cumprod=torch.cat([beta_cumprod[1:],torch.ones(1)])
result=(t_values*beta_cumprod*alpha).cumsum(0)[-1]

the result here is:

 tensor(28.2698, dtype=torch.float64)

You can go even beyond in the vectorization where you can compute the EMA for each date.

This involves some matrix operations and rotations in space. Everything boils down to the construction of the triangular beta matrix, which we will denote as "beta_m". alpha and t_values remain the same.

Here is how you can do it:

numel = t_values.numel()
lower_ones = torch.tril(torch.ones((numel,numel)))
beta_m=(torch.tile(beta, (numel,1)))
beta_m=(beta_m*lower_ones).flip(1)

Here we are going to fill all the 0 values in beta_m to be able to compute a cumprod. Values affected by this change are above the diagonal, so it does not really matter (apart for the cumprod) since we operate under the diagonal when multiplicating by "lower_ones"

beta_m[beta_m==0]=1
beta_m=beta_m.cumprod(1).flip(1) # reflip

The flips induced an asymmetry that needs to be corrected by an inverse rotation on the second dimension. We leverage torch/np.roll for that:

beta_m=torch.roll(beta_m, shifts=-1, dims=1)
beta_m.fill_diagonal_(1)
result=(torch.tile(t_values*alpha,  (numel,1))*lower_ones*beta_m).cumsum(1)[:,-1]

Which gives:

tensor([22.0000, 22.1813, 21.8751, 21.8977, 22.4426, 23.2857, 23.7777, 24.6129,
    25.9456, 27.1474, 28.2698], dtype=torch.float64)

If you have a lot of values, I advise you to run this on a GPU device, just insert "device=cuda" every time you build a tensor.

Thaddeus answered 7/8, 2024 at 11:7 Comment(0)

Recommended topics

Hot tags