I think I understand what OP is going for here. Here is why the above answer is NOT correct. Problem description: we want to compute a rolling function (mean, median, sum, etc) that behaves similarly to np.nan[function]
while retaining the ability to have W
nans at the start of the result due to a window length not being long enough.
The above answer with min_periods
invalidates the behavior of having nan
where the window is not long enough. However, pandas rolling
does not provide the ability to allow nan
in rolling window (of the correct length) to be viewed as a "valid" sample. Here is the required logic:
If (current window length < desired window length) -> return nan
If (current window length == desired window length) -> return np.nan[func](window)
I've seen a few questions like this across stack overflow and the problem is hard to describe, so a lot of people do not have the correct answer. Here is a solution using numba
and using a rolling sum as an example. In my example, I am also providing the ability to not calculate the function over the first block of nan
if such a block exists. Remove A
from this to remove this functionality:
from numba import njit
import numpy as np
@njit
def rolling_nansum(x, W):
# Setup Output Array
out = np.full(len(x), np.nan)
# Find the First non-nan value (virst valid sample for the function)
A = (~np.isnan(x)).argmax()
# Compute the Rolling Function
for i in range(A+W-1, len(x)):
out[i] = np.nansum(x[i-W+1:i+1])
return out
df.apply(lambda x: rolling_nansum(x, 100), raw=True, axis=0)
This was tested on a 4300x1000 element DataFrame and performs the calculation in 480 ms
.
This code applies the correct logic, columnwise across the DataFrame. This is a good way to handle rolling functions on missing data without introducing lookahead bias, or having results with nan
. This is a common use case for financial data.