More efficient way to mean center a sub-set of columns in a pandas dataframe and retain column names

Asked 22/1, 2016 at 18:59 Answered 15/2, 2020 at 22:52

Solved python pandas machine-learning scikit-learn statsmodels

I have a dataframe that has about 370 columns. I'm testing a series of hypothesis that require me to use subsets of the model to fit a cubic regression model. I'm planning on using statsmodels to model this data.

Part of the process for polynomial regression involves mean centering variables (subtracting the mean from every case for a particular feature).

I can do this with 3 lines of code but it seems inefficient, given that I need to replicate this process for half a dozen hypothesis. Keep in mind that I need to data at the coefficient level from the statsmodel output so I need to retain the column names.

Here's a peek at the data. It's the sub-set of columns I need for one of my hypothesis tests.

      i  we  you  shehe  they  ipron
0  0.51   0    0   0.26  0.00   1.02
1  1.24   0    0   0.00  0.00   1.66
2  0.00   0    0   0.00  0.72   1.45
3  0.00   0    0   0.00  0.00   0.53

Here is the code that mean centers and keeps the column names.

from sklearn import preprocessing
#create df of features for hypothesis, from full dataframe
h2 = df[['i', 'we', 'you', 'shehe', 'they', 'ipron']]

#center the variables
x_centered = preprocessing.scale(h2, with_mean='True', with_std='False')

#convert back into a Pandas dataframe and add column names
x_centered_df = pd.DataFrame(x_centered, columns=h2.columns)

Any recommendations on how to make this more efficient / faster would be awesome!

Presidentelect answered 22/1, 2016 at 18:59 Comment(0)

df.apply(lambda x: x-x.mean())

%timeit df.apply(lambda x: x-x.mean())
1000 loops, best of 3: 2.09 ms per loop

df.subtract(df.mean())

%timeit df.subtract(df.mean())
1000 loops, best of 3: 902 µs per loop

both yielding:

        i  we  you  shehe  they  ipron
0  0.0725   0    0  0.195 -0.18 -0.145
1  0.8025   0    0 -0.065 -0.18  0.495
2 -0.4375   0    0 -0.065  0.54  0.285
3 -0.4375   0    0 -0.065 -0.18 -0.635

Avantgarde answered 22/1, 2016 at 19:27 Comment(5)

Thanks very much! The lambda function worked great. Python solutions are so straightforward ...I always assume they are going to be way more complex than they always turn out to be. thanks again!!! – Presidentelect 22/1, 2016 at 21:7

Do you know why the mean that I get out of such an operation is not zero? – Hexylresorcinol 9/7, 2016 at 19:54

If it is very close to zero (say e-15) then it's float representation. If it's truly different from zero then something else is off. Try for instance: np.random.seed(42) values = np.random.randint(-100, 100, 50) np.mean(values - np.mean(values)), which yields 3.97903932026e−15. – Avantgarde 9/7, 2016 at 19:59

Thanks Stefan! I assumed as much, but still found it irritating and had to make sure, especially since I have to show the mean=0 for the course I'm currently taking. – Hexylresorcinol 10/7, 2016 at 7:48

You can take a look at np.isclose which deals with these issues. – Avantgarde 10/7, 2016 at 11:41

I know this question is a little old, but by now Scikit is the fastest solution. Plus, you can condense the code in one line:

pd.DataFrame(preprocessing.scale(df, with_mean=True, with_std=False),columns = df.columns)

%timeit pd.DataFrame(preprocessing.scale(df, with_mean=True, with_std=False),columns = df.columns)
684 µs ± 30.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


test.subtract(df.mean())

%timeit df.subtract(df.mean())
1.63 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

df I used for testing:

df = pd.DataFrame(np.random.randint(low=1, high=10, size=(20,5)),columns = list('abcde'))

Raylenerayless answered 15/2, 2020 at 22:52 Comment(0)

Recommended topics

Hot tags