More efficient way to mean center a sub-set of columns in a pandas dataframe and retain column names
Asked Answered
P

2

5

I have a dataframe that has about 370 columns. I'm testing a series of hypothesis that require me to use subsets of the model to fit a cubic regression model. I'm planning on using statsmodels to model this data.

Part of the process for polynomial regression involves mean centering variables (subtracting the mean from every case for a particular feature).

I can do this with 3 lines of code but it seems inefficient, given that I need to replicate this process for half a dozen hypothesis. Keep in mind that I need to data at the coefficient level from the statsmodel output so I need to retain the column names.

Here's a peek at the data. It's the sub-set of columns I need for one of my hypothesis tests.

      i  we  you  shehe  they  ipron
0  0.51   0    0   0.26  0.00   1.02
1  1.24   0    0   0.00  0.00   1.66
2  0.00   0    0   0.00  0.72   1.45
3  0.00   0    0   0.00  0.00   0.53

Here is the code that mean centers and keeps the column names.

from sklearn import preprocessing
#create df of features for hypothesis, from full dataframe
h2 = df[['i', 'we', 'you', 'shehe', 'they', 'ipron']]

#center the variables
x_centered = preprocessing.scale(h2, with_mean='True', with_std='False')

#convert back into a Pandas dataframe and add column names
x_centered_df = pd.DataFrame(x_centered, columns=h2.columns)

Any recommendations on how to make this more efficient / faster would be awesome!

Presidentelect answered 22/1, 2016 at 18:59 Comment(0)
A
8
df.apply(lambda x: x-x.mean())

%timeit df.apply(lambda x: x-x.mean())
1000 loops, best of 3: 2.09 ms per loop

df.subtract(df.mean())

%timeit df.subtract(df.mean())
1000 loops, best of 3: 902 µs per loop

both yielding:

        i  we  you  shehe  they  ipron
0  0.0725   0    0  0.195 -0.18 -0.145
1  0.8025   0    0 -0.065 -0.18  0.495
2 -0.4375   0    0 -0.065  0.54  0.285
3 -0.4375   0    0 -0.065 -0.18 -0.635
Avantgarde answered 22/1, 2016 at 19:27 Comment(5)
Thanks very much! The lambda function worked great. Python solutions are so straightforward ...I always assume they are going to be way more complex than they always turn out to be. thanks again!!!Presidentelect
Do you know why the mean that I get out of such an operation is not zero?Hexylresorcinol
If it is very close to zero (say e-15) then it's float representation. If it's truly different from zero then something else is off. Try for instance: np.random.seed(42) values = np.random.randint(-100, 100, 50) np.mean(values - np.mean(values)), which yields 3.97903932026e−15.Avantgarde
Thanks Stefan! I assumed as much, but still found it irritating and had to make sure, especially since I have to show the mean=0 for the course I'm currently taking.Hexylresorcinol
You can take a look at np.isclose which deals with these issues.Avantgarde
R
1

I know this question is a little old, but by now Scikit is the fastest solution. Plus, you can condense the code in one line:

pd.DataFrame(preprocessing.scale(df, with_mean=True, with_std=False),columns = df.columns)

%timeit pd.DataFrame(preprocessing.scale(df, with_mean=True, with_std=False),columns = df.columns)
684 µs ± 30.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


test.subtract(df.mean())

%timeit df.subtract(df.mean())
1.63 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

df I used for testing:

df = pd.DataFrame(np.random.randint(low=1, high=10, size=(20,5)),columns = list('abcde'))
Raylenerayless answered 15/2, 2020 at 22:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.