Apply StandardScaler to parts of a data set [duplicate]
Asked Answered
C

6

48

I want to use sklearn's StandardScaler. Is it possible to apply it to some feature columns but not others?

For instance, say my data is:

data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

   Age  Name  Weight
0   18     3      68
1   92     4      59
2   98     6      49


col_names = ['Name', 'Age', 'Weight']
features = data[col_names]

I fit and transform the data

scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features = pd.DataFrame(features, columns = col_names)

       Name       Age    Weight
0 -1.069045 -1.411004  1.202703
1 -0.267261  0.623041  0.042954
2  1.336306  0.787964 -1.245657

But of course the names are not really integers but strings and I don't want to standardize them. How can I apply the fit and transform methods only on the columns Age and Weight?

Cafard answered 17/7, 2016 at 11:47 Comment(1)
I would like to answer a better solution: The accepted answer does not preserve column names and is therefore poor. Instead this on liner should be used: data[['Age', 'Weight']] = StandardScaler().fit_transform(data[['Age', 'Weight']]) –Ferminafermion
W
62

Introduced in v0.20 is ColumnTransformer which applies transformers to a specified set of columns of an array or pandas DataFrame.

import pandas as pd
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

col_names = ['Name', 'Age', 'Weight']
features = data[col_names]

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

ct = ColumnTransformer([
        ('somename', StandardScaler(), ['Age', 'Weight'])
    ], remainder='passthrough')

ct.fit_transform(features)

NB: Like Pipeline it also has a shorthand version make_column_transformer which doesn't require naming the transformers

Output

-1.41100443,  1.20270298,  3.       
 0.62304092,  0.04295368,  4.       
 0.78796352, -1.24565666,  6.       
Whitney answered 23/1, 2019 at 8:21 Comment(6)
This is now the best answer (doesn't require you to copy a data frame)Snuggery
Nice answer ! How couId preserve the column names if I did this with a pandas dataframe ? Is there a way without having to rename all columns at the end ?Pharos
This is what I was looking for, best answer and faster, although using apply is also one alternate.Vassar
The accepted answer does not preserve column names and is therefore poor. Instead use this on liner: data[['Age', 'Weight']] = StandardScaler().fit_transform(data[['Age', 'Weight']])Ferminafermion
Either column names or column order needs to be preserved, otherwise it's very cumbersome to use it. Right now, the passthrough columns are appended to the end and their names are removed, so it's hard to deal with the resulting object.Erdrich
To preserve column names and order see answers to this questionCelebration
B
46

Update:

Currently the best way to handle this is to use ColumnTransformer as explained here.


First create a copy of your dataframe:

scaled_features = data.copy()

Don't include the Name column in the transformation:

col_names = ['Age', 'Weight']
features = scaled_features[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)

Now, don't create a new dataframe but assign the result to those two columns:

scaled_features[col_names] = features
print(scaled_features)


        Age  Name    Weight
0 -1.411004     3  1.202703
1  0.623041     4  0.042954
2  0.787964     6 -1.245657
Bireme answered 17/7, 2016 at 12:3 Comment(5)
It works but I am unable to use the 'inverse_transform' function to obtain the initial values with this method. 'test = scaled_features.iloc[1,:]' 'test_inverse = scaler.inverse_transform(test)' I got the error : ValueError: operands could not be broadcast together with shapes (3,) (2,) (3,)Cafard
scaler.inverse_transform(scaled_features[col_names].values) works for me.Bireme
I was trying to test the inverse_transform function with the first row. Yes it works for me too but I'm losing the column names. I could insert it if I (re)convert the all dataframe. But what if I want to inverse_transform only the first line ?Cafard
Excuse me if I haven't been clear but when I mention column name i design the column containing the names (the 2nd column of the dataframe, the one that I don't want to scaled) not the names of the columnsCafard
Yes (not necessarily the first row, but a new line with the same structure)Cafard
V
8

Late to the party, but here's my preferred solution:

#load data
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

#list for cols to scale
cols_to_scale = ['Age','Weight']

#create and fit scaler
scaler = StandardScaler()
scaler.fit(data[cols_to_scale])

#scale selected data
data[cols_to_scale] = scaler.transform(data[cols_to_scale])
Verticillaster answered 27/8, 2021 at 18:53 Comment(0)
C
5

The easiest way I find is:

from sklearn.preprocessing import StandardScaler
# I'm selecting only numericals to scale
numerical = temp.select_dtypes(include='float64').columns
# This will transform the selected columns and merge to the original data frame
temp.loc[:,numerical] = StandardScaler().fit_transform(temp.loc[:,numerical])

Output

         Age  Name    Weight
0 -1.411004     3  1.202703
1  0.623041     4  0.042954
2  0.787964     6 -1.245657
Casserole answered 10/6, 2021 at 17:8 Comment(0)
S
2

A more pythonic way to do this -

from sklearn.preprocessing import StandardScaler
data[['Age','Weight']] = data[['Age','Weight']].apply(
                           lambda x: StandardScaler().fit_transform(x))
data 

Output -

         Age  Name    Weight
0 -1.411004     3  1.202703
1  0.623041     4  0.042954
2  0.787964     6 -1.245657
Slier answered 17/7, 2016 at 14:7 Comment(1)
"How can I apply the fit and transform functions only on the columns Age and Weight". I was not aware that the OP wanted to do those things.Slier
F
2

Another option would be to drop Name column before scaling then merge it back together:

data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
from sklearn.preprocessing import StandardScaler

# Save the variable you don't want to scale
name_var = data['Name']

# Fit scaler to your data
scaler.fit(data.drop('Name', axis = 1))

# Calculate scaled values and store them in a separate object
scaled_values = scaler.transform(data.drop('Name', axis = 1))

data = pd.DataFrame(scaled_values, index = data.index, columns = data.drop('ID', axis = 1).columns)
data['Name'] = name_var

print(data)
Fagoting answered 26/6, 2018 at 14:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.