Rolling difference in Pandas
Asked Answered
C

6

31

Does anyone know an efficient function/method such as pandas.rolling_mean, that would calculate the rolling difference of an array

This is my closest solution:

roll_diff = pd.Series(values).diff(periods=1)

However, it only calculates single-step rolling difference. Ideally the step size would be editable (i.e. difference between current time step and n last steps).

I've also written this, but for larger arrays, it is quite slow:

def roll_diff(values,step):
    diff = []
    for i in np.arange(step, len(values)-1):
        pers_window = np.arange(i-1,i-step-1,-1)
        diff.append(np.abs(values[i] - np.mean(values[pers_window])))
    diff = np.pad(diff, (0, step+1), 'constant')
    return diff
Clerissa answered 30/1, 2018 at 9:45 Comment(0)
U
36

What about:

import pandas

x = pandas.DataFrame({
    'x_1': [0, 1, 2, 3, 0, 1, 2, 500, ],},
    index=[0, 1, 2, 3, 4, 5, 6, 7])

x['x_1'].rolling(window=2).apply(lambda x: x.iloc[1] - x.iloc[0])

in general you can replace the lambda function with your own function. Note that in this case the first item will be NaN.

Update

Defining the following:

n_steps = 2
def my_fun(x):
    return x.iloc[-1] - x.iloc[0]

x['x_1'].rolling(window=n_steps).apply(my_fun)

you can compute the differences between values at n_steps.

Uptodate answered 30/1, 2018 at 10:28 Comment(6)
This is a good solution for small dataframes, as long as you figure out how to replace NaN. However, I would say it's not efficient since pd.Series.apply is not vectorised, but a thinly veiled loop.Polly
@jp_data_analysis if you have big data maybe a better approach would to use dask instead of pandas. About NaNs, they appear to keep the size of the array consistent, if you do not need them you can cut them away, when you do differential operations these are the common approaches: adding NaNs or cut the array.Uptodate
@Pieruluigi, true re: dask. When I say big data, I also mean repeated calculations on small data (where dask isn't as useful). For pandas performance, the hierarchy is generally: in-built pandas method, numpy functions, series.apply, df.apply, finally df.iterrows.Polly
@jp_data_analysis agree, if the goal here is to be as fast as possible also implementing the solution with cython could be a way, but before going in that direction it would be better to know the real use case :-)Uptodate
Is it x[-1] -x[0] or x[1]-x[0] for Difference? I believe x[-1] works correctly. x[1] will take next day's difference and not the window of difference, right?Reynaud
Another point to note is that rolling(periods=n) is actually n-1 as compared to pct_change(periods=n) where it is actuall nReynaud
N
15

You can do the same thing as in https://mcmap.net/q/470995/-making-a-custom-window-type-for-pandas-rolling-mean if you work directly on the underlying numpy array:

import numpy as np
diff_kernel = np.array([1,-1])
np.convolve(rs,diff_kernel ,'same')

where rs is your pandas series

Niemeyer answered 30/1, 2018 at 10:38 Comment(0)
P
3

This should work:

import numpy as np

x = np.array([1, 3, 6, 1, -5, 6, 4, 1, 6])

def running_diff(arr, N):
    return np.array([arr[i] - arr[i-N] for i in range(N, len(arr))])

running_diff(x, 4)  # array([-6,  3, -2,  0, 11])

For a given pd.Series, you will have to define what you want for the first few items. The below example just returns the initial series values.

s_roll_diff = np.hstack((s.values[:4], running_diff(s.values, 4)))

This works because you can assign a np.array directly to a pd.DataFrame, e.g. for a column s, df.s_roll_diff = np.hstack((df.s.values[:4], running_diff(df.s.values, 4)))

Polly answered 30/1, 2018 at 10:2 Comment(0)
P
3

If you got KeyError: 0, try with iloc:

import pandas

x = pandas.DataFrame({
    'x_1': [0, 1, 2, 3, 0, 1, 2, 500, ],},
    index=[0, 1, 2, 3, 4, 5, 6, 7])

x['x_1'].rolling(window=2).apply(lambda x: x.iloc[1] - x.iloc[0])
Participle answered 19/2, 2020 at 12:36 Comment(0)
A
2

Applying numpy.diff:

import pandas as pd
import numpy as np

x = pd.DataFrame({
    'x_1': [0, 1, 2, 3, 0, 1, 2, 500, ],}
)

print(x)

>>>   x_1
0    0
1    1
2    2
3    3
4    0
5    1
6    2
7  500

print(x['x_1'].rolling(window=2).apply(np.diff))

>>>0      NaN
1      1.0
2      1.0
3      1.0
4     -3.0
5      1.0
6      1.0
7    498.0
Name: x_1, dtype: float64
Alanson answered 13/7, 2021 at 19:47 Comment(0)
O
0

If you have unevenly-spaced intervals, or temporal gaps in your data, and you want to use a rolling window of time frequencies, rather than number of periods, you can easily end up in a situation where x.iloc[-1] - x.iloc[0] doesn't return the result you expect. Pandas can construct windows with exactly 1 point, so x.iloc[-1] == x.iloc[0] and the diff is always 0.

Sometimes this is the desired outcome, but other times you might want to use the last-known value from before the start of each window.

A general solution (perhaps not so efficient) is to first artificially construct an evenly-spaced series, interpolate or fill data as needed (e.g. using Series.ffill), and then use the .rolling() techniques described in other answers.

# Data with temporal gaps
y = pd.Series(..., index=DatetimeIndex(...))

# Your desired frequency
freq = '1M'

# Construct a new Index with this frequency, using your data ranges
idx_artificial = pd.date_range(y.index.min(), y.index.max(), freq=freq)

# Artificially expand the data to the evenly-spaced index
# New data points will be inserted with null/NaN values
y_artificial = y.reindex(idx_artificial)

# Fill the empty values with last-known value
# This part will vary depending on your needs
y_artificial.ffill(inplace=True)

# Now compute the diffs, using the forward-filled artificially-spaced data
y_diff = y.rolling(freq=freq).apply(lambda x: x.iat[-1] - x.iat[0])

And here are some helper functions to implement the above, for your copy-paste pleasure (warning: lightly-tested code written by a complete stranger, use with caution):

def date_range_from_index(index, freq=None, start=None, end=None, **kwargs):
    if start is None:
        start = index.min()
    if end is None:
        end = index.max()
    if freq is None:
        try:
            freq = index.freq
        except AttributeError:
            freq = None
        if freq is None:
            raise ValueError('Frequency not provided and input has no set frequency.')
    return pd.date_range(start, end, freq=freq, **kwargs)

def fill_dtindex(y, freq=None, start=None, end=None, fill=None):
    new_index = date_range_from_index(y.index, freq=freq, start=start, end=end)
    y = y.reindex(new_index)
    if fill is not None:
        if isinstance(fill, str):
            y = y.fillna(method=fill)
        else:
            y = y.fillna(fill)
    return y
Ousley answered 22/2, 2023 at 3:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.