Apply StandardScaler to parts of a data set [duplicate]

Asked 17/7, 2016 at 11:47 Answered 27/8, 2021 at 18:53

Solved python pandas scikit-learn scale data-science

I want to use sklearn's StandardScaler. Is it possible to apply it to some feature columns but not others?

For instance, say my data is:

data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

   Age  Name  Weight
0   18     3      68
1   92     4      59
2   98     6      49


col_names = ['Name', 'Age', 'Weight']
features = data[col_names]

I fit and transform the data

scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features = pd.DataFrame(features, columns = col_names)

       Name       Age    Weight
0 -1.069045 -1.411004  1.202703
1 -0.267261  0.623041  0.042954
2  1.336306  0.787964 -1.245657

But of course the names are not really integers but strings and I don't want to standardize them. How can I apply the fit and transform methods only on the columns Age and Weight?

Cafard answered 17/7, 2016 at 11:47 Comment(1)

I would like to answer a better solution: The accepted answer does not preserve column names and is therefore poor. Instead this on liner should be used: data[['Age', 'Weight']] = StandardScaler().fit_transform(data[['Age', 'Weight']]) – – Ferminafermion 12/4, 2022 at 17:0

Introduced in v0.20 is ColumnTransformer which applies transformers to a specified set of columns of an array or pandas DataFrame.

import pandas as pd
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

col_names = ['Name', 'Age', 'Weight']
features = data[col_names]

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

ct = ColumnTransformer([
        ('somename', StandardScaler(), ['Age', 'Weight'])
    ], remainder='passthrough')

ct.fit_transform(features)

NB: Like Pipeline it also has a shorthand version make_column_transformer which doesn't require naming the transformers

Output

-1.41100443,  1.20270298,  3.       
 0.62304092,  0.04295368,  4.       
 0.78796352, -1.24565666,  6.

Whitney answered 23/1, 2019 at 8:21 Comment(6)

This is now the best answer (doesn't require you to copy a data frame) – Snuggery 5/2, 2019 at 17:18

Nice answer ! How couId preserve the column names if I did this with a pandas dataframe ? Is there a way without having to rename all columns at the end ? – Pharos 23/4, 2020 at 13:37

This is what I was looking for, best answer and faster, although using apply is also one alternate. – Vassar 6/7, 2020 at 9:14

The accepted answer does not preserve column names and is therefore poor. Instead use this on liner: data[['Age', 'Weight']] = StandardScaler().fit_transform(data[['Age', 'Weight']]) – Ferminafermion 12/4, 2022 at 16:55

Either column names or column order needs to be preserved, otherwise it's very cumbersome to use it. Right now, the passthrough columns are appended to the end and their names are removed, so it's hard to deal with the resulting object. – Erdrich 9/12, 2022 at 12:3

To preserve column names and order see answers to this question – Celebration 14/2, 2023 at 10:18

Update:

Currently the best way to handle this is to use ColumnTransformer as explained here.

First create a copy of your dataframe:

scaled_features = data.copy()

Don't include the Name column in the transformation:

col_names = ['Age', 'Weight']
features = scaled_features[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)

Now, don't create a new dataframe but assign the result to those two columns:

scaled_features[col_names] = features
print(scaled_features)


        Age  Name    Weight
0 -1.411004     3  1.202703
1  0.623041     4  0.042954
2  0.787964     6 -1.245657

Bireme answered 17/7, 2016 at 12:3 Comment(5)

It works but I am unable to use the 'inverse_transform' function to obtain the initial values with this method. 'test = scaled_features.iloc[1,:]' 'test_inverse = scaler.inverse_transform(test)' I got the error : ValueError: operands could not be broadcast together with shapes (3,) (2,) (3,) – Cafard 17/7, 2016 at 13:1

scaler.inverse_transform(scaled_features[col_names].values) works for me. – Bireme 17/7, 2016 at 13:6

I was trying to test the inverse_transform function with the first row. Yes it works for me too but I'm losing the column names. I could insert it if I (re)convert the all dataframe. But what if I want to inverse_transform only the first line ? – Cafard 17/7, 2016 at 13:22

Excuse me if I haven't been clear but when I mention column name i design the column containing the names (the 2nd column of the dataframe, the one that I don't want to scaled) not the names of the columns – Cafard 17/7, 2016 at 13:41

Yes (not necessarily the first row, but a new line with the same structure) – Cafard 17/7, 2016 at 13:49

Late to the party, but here's my preferred solution:

#load data
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

#list for cols to scale
cols_to_scale = ['Age','Weight']

#create and fit scaler
scaler = StandardScaler()
scaler.fit(data[cols_to_scale])

#scale selected data
data[cols_to_scale] = scaler.transform(data[cols_to_scale])

Verticillaster answered 27/8, 2021 at 18:53 Comment(0)

The easiest way I find is:

from sklearn.preprocessing import StandardScaler
# I'm selecting only numericals to scale
numerical = temp.select_dtypes(include='float64').columns
# This will transform the selected columns and merge to the original data frame
temp.loc[:,numerical] = StandardScaler().fit_transform(temp.loc[:,numerical])

Output

         Age  Name    Weight
0 -1.411004     3  1.202703
1  0.623041     4  0.042954
2  0.787964     6 -1.245657

Casserole answered 10/6, 2021 at 17:8 Comment(0)

A more pythonic way to do this -

from sklearn.preprocessing import StandardScaler
data[['Age','Weight']] = data[['Age','Weight']].apply(
                           lambda x: StandardScaler().fit_transform(x))
data

Output -

         Age  Name    Weight
0 -1.411004     3  1.202703
1  0.623041     4  0.042954
2  0.787964     6 -1.245657

Slier answered 17/7, 2016 at 14:7 Comment(1)

"How can I apply the fit and transform functions only on the columns Age and Weight". I was not aware that the OP wanted to do those things. – Slier 17/7, 2016 at 14:37

Another option would be to drop Name column before scaling then merge it back together:

data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
from sklearn.preprocessing import StandardScaler

# Save the variable you don't want to scale
name_var = data['Name']

# Fit scaler to your data
scaler.fit(data.drop('Name', axis = 1))

# Calculate scaled values and store them in a separate object
scaled_values = scaler.transform(data.drop('Name', axis = 1))

data = pd.DataFrame(scaled_values, index = data.index, columns = data.drop('ID', axis = 1).columns)
data['Name'] = name_var

print(data)

Fagoting answered 26/6, 2018 at 14:4 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Output

Update:

Output

Recommended topics

Hot tags