Pandas Rolling Apply custom
Asked Answered
H

2

33

I have been following a similar answer here, but I have some questions when using sklearn and rolling apply. I am trying to create z-scores and do PCA with rolling apply, but I keep on getting 'only length-1 arrays can be converted to Python scalars' error.

Following the previous example I create a dataframe

from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
sc=StandardScaler() 
tmp=pd.DataFrame(np.random.randn(2000,2)/10000,index=pd.date_range('2001-01-01',periods=2000),columns=['A','B'])

If I use the rolling command:

 tmp.rolling(window=5,center=False).apply(lambda x: sc.fit_transform(x))
 TypeError: only length-1 arrays can be converted to Python scalars

I get this error. I can however create functions with mean and standard deviations with no problem.

def test(df):
    return np.mean(df)
tmp.rolling(window=5,center=False).apply(lambda x: test(x))

I believe the error occurs when I am trying to subtract the mean by the current values for z-score.

def test2(df):
    return df-np.mean(df)
tmp.rolling(window=5,center=False).apply(lambda x: test2(x))
only length-1 arrays can be converted to Python scalars

How can I create custom rolling functions with sklearn to first standardize and then run PCA?

EDIT: I realize my question was not exactly clear so I shall try again. I want to standardize my values and then run PCA to get the amount of variance explained by each factor. Doing this without rolling is fairly straightforward.

testing=sc.fit_transform(tmp)
pca=decomposition.pca.PCA() #run pca
pca.fit(testing) 
pca.explained_variance_ratio_
array([ 0.50967441,  0.49032559])

I cannot use this same procedure when rolling. Using the rolling zscore function from @piRSquared gives the zscores. It seems that PCA from sklearn is incompatible with the rolling apply custom function. (In fact I think this is the case with most sklearn modules.) I am just trying to get the explained variance which is a one dimensional item, but this code below returns a bunch of NaNs.

def test3(df):
    pca.fit(df)
    return pca.explained_variance_ratio_
tmp.rolling(window=5,center=False).apply(lambda x: test3(x))

However, I can create my own explained variance function, but this also does not work.

def test4(df):
    cov_mat=np.cov(df.T) #need covariance of features, not observations
    eigen_vals,eigen_vecs=np.linalg.eig(cov_mat)
    tot=sum(eigen_vals)
    var_exp=[(i/tot) for i in sorted(eigen_vals,reverse=True)]
    return var_exp
tmp.rolling(window=5,center=False).apply(lambda x: test4(x))

I get this error 0-dimensional array given. Array must be at least two-dimensional.

To recap, I would like to run rolling z-scores and then rolling pca outputting the explained variance at each roll. I have the rolling z-scores down but not explained variance.

Hooky answered 4/12, 2016 at 1:51 Comment(1)
What do you expect the output to be? A pandas rolling function is supposed to produce a single scalar value from a chunk of input. If you want to do more complex operations on chunks you'll have to "roll your own roll".Cockle
S
47

As @BrenBarn commented, the rolling function needs to reduce a vector to a single number. The following is equivalent to what you were trying to do and help's highlight the problem.

zscore = lambda x: (x - x.mean()) / x.std()
tmp.rolling(5).apply(zscore)
TypeError: only length-1 arrays can be converted to Python scalars

In the zscore function, x.mean() reduces, x.std() reduces, but x is an array. Thus the entire thing is an array.


The way around this is to perform the roll on the parts of the z-score calculation that require it, and not on the parts that cause the problem.

(tmp - tmp.rolling(5).mean()) / tmp.rolling(5).std()

enter image description here

Szabo answered 4/12, 2016 at 6:31 Comment(1)
Thanks for the z-score part. I tried to do something similar for the PCA section to no avail. Does the lambda mess up the PCA because I am doing it for many lines and not just one?Hooky
E
16

Since x in lambda function represents a (rolling) series/ndarray, the lambda function can be coded like this (where x[-1] refers to current rolling data point):

zscore = lambda x: (x[-1] - x.mean()) / x.std(ddof=1)

Then it is OK to call:

tmp.rolling(5).apply(zscore)

Also noted that the degree of freedom defaults to 1 in tmp.rolling(5).std() In order to generate the same results as @piRSquared's, one has to specify the ddof for x.std(), which defaults to 0. --It took quite a while to figure this out!

Enteron answered 5/6, 2020 at 1:32 Comment(2)
Hi Jerry, I get a Key Error on x[-1] when trying your answer. x is of class pandas.core.series.Series. Using x.values[-1] solved the problem for me.Lungfish
Is this still true if the parameter center=True is given to the rolling function?Transfusion

© 2022 - 2024 — McMap. All rights reserved.