How to ignore NaN in rolling average calculation in Python
Asked Answered
G

2

6

For a time series sales forecasting task I want to create a feature that represents the average sales over the last 3 days. I have a problem when I want to predict the sales for days in the future, since these data points do not have sales data (NaN values). Pandas offers rolling_mean(), but that function results in a NaN ouput when any data point in the window is NaN.

My data:

Date    Sales
02-01-2013  100.0
03-01-2013  200.0
04-01-2013  300.0
05-01-2013  200.0
06-01-2013  NaN

Result after using pd.rolling_mean() with window of 2:

Date    Rolling_Sales
02-01-2013  NaN
03-01-2013  150.0
04-01-2013  250.0
05-01-2013  250.0
06-01-2013  NaN

Desired result:

Date    Rolling_Sales
02-01-2013  NaN
03-01-2013  150.0
04-01-2013  250.0
05-01-2013  250.0
06-01-2013  200.0

So in case the a NaN is included, I want to ignore it and take the average of all the other data points in the window.

Garbanzo answered 5/4, 2018 at 16:47 Comment(1)
Are you looking for df.ffill().Sales.rolling(2).mean()?Wilma
E
2

Here is on way adding min_periods

s=df.Sales.rolling(window=2,min_periods=1).mean()
s.iloc[0]=np.nan
s
Out[1293]: 
0      NaN
1    150.0
2    250.0
3    250.0
4    200.0
Name: Sales, dtype: float64
Encomiast answered 5/4, 2018 at 16:55 Comment(3)
This does not address the problem. You're assuming the NaNs result at the beginning of the window, but there appears to be a bug in pandas... you can't ignore/skip nan values that occur later in the series as well. min_periods shows up everywhere as an answer to this. It IS NOT the answer.Lockjaw
@Lockjaw so what is the answer?Encomiast
This might not be the best answer, but what I have done partly in response to @Lockjaw point is to us df.fillna(method='ffill') before using rolling. This might not work for all and depends on your data, but for my purposes forward filling to remove nan values was acceptable practice.Glyptograph
B
0

I think I understand what OP is going for here. Here is why the above answer is NOT correct. Problem description: we want to compute a rolling function (mean, median, sum, etc) that behaves similarly to np.nan[function] while retaining the ability to have W nans at the start of the result due to a window length not being long enough.

The above answer with min_periods invalidates the behavior of having nan where the window is not long enough. However, pandas rolling does not provide the ability to allow nan in rolling window (of the correct length) to be viewed as a "valid" sample. Here is the required logic:

If (current window length < desired window length) -> return nan

If (current window length == desired window length) -> return np.nan[func](window)

I've seen a few questions like this across stack overflow and the problem is hard to describe, so a lot of people do not have the correct answer. Here is a solution using numba and using a rolling sum as an example. In my example, I am also providing the ability to not calculate the function over the first block of nan if such a block exists. Remove A from this to remove this functionality:

from numba import njit
import numpy as np
@njit
def rolling_nansum(x, W):
    # Setup Output Array
    out = np.full(len(x), np.nan)

    # Find the First non-nan value (virst valid sample for the function)
    A = (~np.isnan(x)).argmax()

    # Compute the Rolling Function
    for i in range(A+W-1, len(x)):
        out[i] = np.nansum(x[i-W+1:i+1])
    return out

df.apply(lambda x: rolling_nansum(x, 100), raw=True, axis=0)

This was tested on a 4300x1000 element DataFrame and performs the calculation in 480 ms.

This code applies the correct logic, columnwise across the DataFrame. This is a good way to handle rolling functions on missing data without introducing lookahead bias, or having results with nan. This is a common use case for financial data.

Brooking answered 1/7, 2023 at 23:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.