how to use ColumnTransformer() to return a dataframe?

Asked 31/1, 2022 at 21:26 Answered 19/5, 2023 at 12:47

Solved python-3.x dataframe scikit-learn encoder

I have a dataframe like this:

department      review  projects salary satisfaction bonus  avg_hrs_month   left
0   operations  0.577569    3   low         0.626759    0   180.866070      0
1   operations  0.751900    3   medium      0.443679    0   182.708149      0
2   support     0.722548    3   medium      0.446823    0   184.416084      0
3   logistics   0.675158    4   high        0.440139    0   188.707545      0
4   sales       0.676203    3   high        0.577607    1   179.821083      0

I want to try ColumnTransformer() and return a transformed dataframe.

ord_features = ["salary"]
ordinal_transformer = OrdinalEncoder()


cat_features = ["department"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

ct = ColumnTransformer(
    transformers=[
        ("ord", ordinal_transformer, ord_features),
        ("cat", categorical_transformer, cat_features ),
           ]
)

df_new = ct.fit_transform(df)
df_new

which gives me a 'sparse matrix of type '<class 'numpy.float64'>'

if I use pd.DataFrame(ct.fit_transform(df)) then I'm getting a single column:

                            0
0   (0, 0)\t1.0\n (0, 7)\t1.0
1   (0, 0)\t2.0\n (0, 7)\t1.0
2   (0, 0)\t2.0\n (0, 10)\t1.0
3   (0, 5)\t1.0
4   (0, 9)\t1.0

however, I was expecting to see the transformed dataframe like this?

    review  projects salary satisfaction bonus  avg_hrs_month   operations support ...
0   0.577569    3    1      0.626759     0      180.866070      1           0
1   0.751900    3    2      0.443679     0      182.708149      1           0  
2   0.722548    3    2      0.446823     0      184.416084      0           1
3   0.675158    4    3      0.440139     0      188.707545      0           0
4   0.676203    3    3      0.577607     1      179.821083      0           0

Is it possible with ColumnTransformer()?

Pepi answered 31/1, 2022 at 21:26 Comment(1)

You can call .toarray() on the output of .fit_transform(), as follows pd.DataFrame(ct.fit_transform(df).toarray()). For column names, instead, you'll have to stick to something custom because OrdinalEncoder does not provide method .get_feature_names_out(), differently from OneHotEncoder. Eventually, for transformed column order I would suggest to have a look at #68874992. – Hildegardehildesheim 31/1, 2022 at 22:9

As quickly sketched in the comment there are a couple of considerations to be done on your example:

method .fit_transform() generally returns either a sparse matrix or a numpy array. Returning a sparse matrix serves the purpose of saving memory; think to the example where you one-hot-encode a categorical attribute with many categories. You'll end up having a matrix with many columns and a single non-zero entry per row; with a sparse matrix you can store the location of the non-zero element only. In these situation you can call .toarray() on the output of .fit_transform() to get a numpy array back to be passed to the pd.DataFrame constructor.

Actually, on a five-rows dataset similar to the one you provided
```
df = pd.DataFrame({
    'department': ['operations', 'operations', 'support', 'logistics', 'sales'],
    'review': [0.577569, 0.751900, 0.722548, 0.675158, 0.676203],
    'projects': [3, 3, 3, 4, 3],
    'salary': ['low', 'medium', 'medium', 'low', 'high'],
    'satisfaction': [0.626759, 0.751900, 0.722548, 0.675158, 0.676203],
    'bonus': [0, 0, 0, 0, 1],
    'avg_hrs_month': [180.866070, 182.708149, 184.416084, 188.707545, 179.821083],
    'left': [0, 0, 1, 0, 0]
})

ord_features = ["salary"]
ordinal_transformer = OrdinalEncoder()

cat_features = ["department"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

ct = ColumnTransformer(transformers=[
    ("ord", ordinal_transformer, ord_features),
    ("cat", categorical_transformer, cat_features),
])
```
I can't reproduce your issue (namely, I directly obtain a numpy array), but basically pd.DataFrame(ct.fit_transform(df).toarray()) should work for your case. This is the output you would get:
As you can see, with respect to your expected output, this only contains the transformed (ordinally encoded) salary column as first column and the transformed (one-hot-encoded) department column from the second to the last column. That's because, as you can see within the docs, parameter remainder is set to 'drop' by default, which implies that all columns which are not subject to transformation are dropped. To avoid this, you should set it to 'passthrough'; this will help you to transform the columns you need and keep the other untouched.
```
ct = ColumnTransformer(transformers=[
    ("ord", ordinal_transformer, ord_features),
    ("cat", categorical_transformer, cat_features )],
    remainder='passthrough'
)
```
This would be the output of your pd.DataFrame(ct.fit_transform(df).toarray()) in such a case:
Again, as you can see also column order is not the one you would expect after the transformation. Long story short, that's because in a ColumnTransformer

The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.

I would aggest reading Preserve column order after applying sklearn.compose.ColumnTransformer at this proposal.

Eventually, for what concerns column names you should probably apply a custom solution passing what you want directly to the columns parameter to be passed to the pd.DataFrame constructor. Indeed, OrdinalEncoder (differently from OneHotEncoder) does not provide a .get_feature_names_out() method that makes it generally easy to pass columns=ct.get_feature_names_out() to the pd.DataFrame constructor. See ColumnTransformer & Pipeline with OHE - Is the OHE encoded field retained or removed after ct is performed? for an example of its usage.

Update 10/2022 - sklearn version 1.2.dev0

With sklearn version 1.2.0 it will be possible to solve the problem of returning a DataFrame when transforming a ColumnTransformer instance much more easily. Such version has not been released yet, but you can test the following in dev (version 1.2.dev0), by installing the nightly builds as such:

pip install --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple scikit-learn -U

The ColumnTransformer (and other transformers as well) now exposes a .set_output() method which gives the possibility to configure a transformer to output pandas DataFrames, by passing parameter transform='pandas' to it.

Therefore, the example becomes:

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier

df = pd.DataFrame({
    'department': ['operations', 'operations', 'support', 'logistics', 'sales'],
    'review': [0.577569, 0.751900, 0.722548, 0.675158, 0.676203],
    'projects': [3, 3, 3, 4, 3],
    'salary': ['low', 'medium', 'medium', 'low', 'high'],
    'satisfaction': [0.626759, 0.751900, 0.722548, 0.675158, 0.676203],
    'bonus': [0, 0, 0, 0, 1],
    'avg_hrs_month': [180.866070, 182.708149, 184.416084, 188.707545, 179.821083],
    'left': [0, 0, 1, 0, 0]
})

ord_features = ["salary"]
ordinal_transformer = OrdinalEncoder()

cat_features = ["department"]
categorical_transformer = OneHotEncoder(sparse_output=False, handle_unknown="ignore")

ct = ColumnTransformer(transformers=[
    ("ord", ordinal_transformer, ord_features),
    ("cat", categorical_transformer, cat_features )],
    remainder='passthrough'
)

ct.set_output('pandas')
df_pandas = ct.fit_transform(df)
df_pandas

The output also becomes much easier to read as it has proper column names (indeed, at each step, the transformers of which ColumnTransformer is made of do have the attribute feature_names_in_; so you don't lose column names anymore while transforming the input).

Last note. Observe that the example now requires parameter sparse_output=False to be passed to the OneHotEncoder instance in order to work.

Hildegardehildesheim answered 1/2, 2022 at 0:23 Comment(0)

This answer skips the workaround and directly provides a solution for scikit-learn version 1.2+

From sklearn version 1.2 on, transformers can return a pandas DataFrame directly without further handling. It is done with set_output, which can be configured per estimator by calling the set_output method or globally by setting set_config(transform_output="pandas"). See Release Highlights for scikit-learn 1.2 - Pandas output with set_output API

In your case the solution would be:

ord_features = ["salary"]
ordinal_transformer = OrdinalEncoder()


cat_features = ["department"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

ct = ColumnTransformer(
    transformers=[
        ("ord", ordinal_transformer, ord_features),
        ("cat", categorical_transformer, cat_features ),
           ]
)

# Add the following line to your code
ct.set_output(transform="pandas")

df_new = ct.fit_transform(df)
df_new

Tickler answered 8/12, 2022 at 16:8 Comment(0)

When using the FunctionTransformer it's imported to add feature_names_out='one-to-one' to ensure the names and location of the dataframe columns are being return by FunctionTransformer

log_transformer = Pipeline(steps=[
     ('imputer', SimpleImputer(strategy='median')),
     ('log_tranformer', FunctionTransformer(np.log1p, validate=True, feature_names_out='one-to-one'))
])

The dataframe columns are numbers when not using featrure_name_out

With feature_name_out

Sigmund answered 10/3, 2023 at 12:40 Comment(0)

You can construct a dataframe from the output of column transformers as follows:

#a pre-proc pipeline of several transformers acting sequentially 
df_std = preproc.fit_transform( data )    ##****  np

You can convert it to a pandas:

#convert it to a DF
pd.DataFrame( df_std, columns = preproc.get_feature_names_out())

This is the complete example. You can copy and paste

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline 

data = pd.DataFrame( {'a':[1,2,3,4,np.nan], 
                      'b':[1111.1,22222.2,33333.3,4433.3,5555.5], 
                      'c':['s1','s2','s3','s4','s5'] })
                
display(data.head(5))

#one transformer  
imputer = ColumnTransformer( [ ('imp',SimpleImputer(),['a'])], 
remainder='passthrough',verbose_feature_names_out=False)

#another
scaler  = ColumnTransformer( [ ('scaler',MinMaxScaler(),[0,1])], 
remainder='passthrough',verbose_feature_names_out=False)

#another 
encoder  = ColumnTransformer( [ ('encoder',OneHotEncoder(),[2])], 
remainder='passthrough',verbose_feature_names_out=False)

preproc = Pipeline( steps = [('imp',imputer) , 
                             ('std',scaler) , 
                             ('enc',encoder) 
                            ])

df_std = preproc.fit_transform( data )    ##****  np 

pd.DataFrame( df_std, columns = preproc.get_feature_names_out())

Zootechnics answered 19/5, 2023 at 12:47 Comment(0)

Update 10/2022 - sklearn version 1.2.dev0

This answer skips the workaround and directly provides a solution for scikit-learn version 1.2+

Recommended topics

Hot tags