Python Pandas: calculate rolling mean (moving average) over variable number of rows

Asked 24/11, 2017 at 22:53 Answered 24/11, 2017 at 23:33

Solved python pandas dataframe time-series moving-average

Say I have the following dataframe

import pandas as pd
df = pd.DataFrame({ 'distance':[2.0, 3.0, 1.0, 4.0],
                    'velocity':[10.0, 20.0, 5.0, 40.0] })

gives the dataframe

   distance  velocity
0         2.0        10.0
1         3.0        20.0
2         1.0        5.0
3         4.0        40.0

How can I calculate the average of the velocity column over the rolling sum of the distance column? With the example above, create a rolling sum over the last N rows in order to get a minimum cumulative distance of 5, and then calculate the average velocity over those rows.

My target output would then be like this:

   distance  velocity    rv
0         2.0        10.0    NaN
1         3.0        20.0    15.0
2         1.0         5.0    11.7
3         4.0        40.0    22.5

where

15.0 = (10+20)/2        (2 because 3 + 2     >= 5)
11.7 = (10 + 20 + 5)/3  (3 because 1 + 3 + 2 >= 5) 
22.5 = (5 + 40)/2       (2 because 4 + 1     >= 5)

Update: in Pandas-speak, my code should find the index of the reverse cumulative distance sum back from my current record (such that it is 5 or greater), and then use that index to calculate the start of the moving average.

Existentialism answered 24/11, 2017 at 22:53 Comment(5)

Is the distance always integral and positive? – Chalk 24/11, 2017 at 23:20

distance and velocity are always positive but not integer. I updated the question. thanks – Existentialism 24/11, 2017 at 23:23

This question was really hard to understand, I edited to hopefully make it more clear using @Chalk 's answer to interpret. Hopefully my edits make this more clear and of course you can further edit as you see fit. – Mila 27/11, 2017 at 19:48

As far as the question goes, it will be hard (maybe impossible?) to vectorize. If @Chalk 's answer is fast enough, I'd just use that. If it's not, you could look to convert his function into a numpy or numba function. Converting it to a numba function is probably the best approach in terms of something that will be fast, readable, and easy to implement (and it might just be the fastest way period) – Mila 27/11, 2017 at 19:50

Vectorization is made hard in particular by the fact that the windows may become arbitrarily large. Had distances been integral and positive, we could assume that windows would always be of size at most 5 at which point precalculation may provide a passable solution. In any case, following @JohnE's comment, I added a few examples of how numba effectively solves the problem in the answer below. – Chalk 27/11, 2017 at 21:0

Not a particularly pandasy solution, but it sounds like you want to do something like

df['rv'] = np.nan
for i in range(len(df)):
    j = i
    s = 0
    while j >= 0 and s < 5:
        s += df['distance'].loc[j]
        j -= 1
    if s >= 5:
        df['rv'].loc[i] = df['velocity'][j+1:i+1].mean()

Update: Since this answer, the OP stated that they want a "valid Pandas solution (e.g. without loops)". If we take this to mean that they want something more performant than the above, then, perhaps ironically given the comment, the first optimization that comes to mind is to avoid the data frame unless needed:

l = len(df)
a = np.empty(l)
d = df['distance'].values
v = df['velocity'].values
for i in range(l):
    j = i
    s = 0
    while j >= 0 and s < 5:
        s += d[j]
        j -= 1
    if s >= 5:
        a[i] = v[j+1:i+1].mean()
df['rv'] = a

Moreover, as suggested by @JohnE, numba quickly comes in handy for further optimization. While it won't do much on the first solution above, the second solution can be decorated with a @numba.jit out-of-the-box with immediate benefits. Benchmarking all three solutions on

pd.DataFrame({'velocity': 50*np.random.random(10000), 'distance': 5*np.random.rand(10000)})

I get the following results:

          Method                 Benchmark
-----------------------------------------------
Original data frame based     4.65 s ± 325 ms
Pure numpy array based       80.8 ms ± 9.95 ms
Jitted numpy array based      766 µs ± 52 µs

Even the innocent-looking mean is enough to throw off numba; if we get rid of that and go instead with

@numba.jit
def numba_example():
    l = len(df)
    a = np.empty(l)
    d = df['distance'].values
    v = df['velocity'].values
    for i in range(l):
        j = i
        s = 0
        while j >= 0 and s < 5:
            s += d[j]
            j -= 1
        if s >= 5:
            for k in range(j+1, i+1):
                a[i] += v[k]
            a[i] /= (i-j)
    df['rv'] = a

then the benchmark reduces to 158 µs ± 8.41 µs.

Now, if you happen to know more about the structure of df['distance'], the while loop can probably be optimized further. (For example, if the values happen to always be much lower than 5, it will be faster to cut the cumulative sum from its tail, rather than recalculating everything.)

Chalk answered 24/11, 2017 at 23:14 Comment(5)

interesting - thanks. (sorry for the confusion with the 30s. I replaced them for 5s to make the data more plausible) – Existentialism 24/11, 2017 at 23:18

Nice! Numba is pretty amazing sometimes. – Mila 27/11, 2017 at 20:31

Are you sure mean() caused a problem for numba? Generally numpy code works fine with numba, especially standard stuff like mean. I don't think it will matter for speed, btw, just would be very surprising to me if numba couldn't handle numpy mean(). – Mila 29/11, 2017 at 17:54

Good question. I'm not going to dig into it the difference in numbers myself, but inspect_llvm might provide some hints? – Chalk 29/11, 2017 at 19:1

The globals probably do mess up the benchmark a bit, but removing those I still see a big difference in results. – Chalk 29/11, 2017 at 19:22

How about

df.rolling(window=3, min_periods=2).mean()

   distance   velocity
0       NaN        NaN
1  2.500000  15.000000
2  2.000000  11.666667
3  2.666667  21.666667

To combine them

df['rv'] = df.velocity.rolling(window=3, min_periods=2).mean()

It looks like something's a little off with the window shape.

Fondue answered 24/11, 2017 at 23:33 Comment(2)

I see how this works, but how would I tune the rolling_sum(distance) as a pre-aggregate for the window of the average velocity? – Existentialism 25/11, 2017 at 15:34

df['rv'] is completely independent of df['distance'] here; that's not what they want. – Chalk 25/11, 2017 at 18:51

Recommended topics

Hot tags