Preserve column order after applying sklearn.compose.ColumnTransformer
Asked Answered
D

2

19

I'm using Pipeline and ColumnTransformer modules from sklearn library to perform feature engineering on my dataset.

The dataset initially looks like this:

date date_block_num shop_id item_id item_price
02.01.2013 0 59 22154 999.00
03.01.2013 0 25 2552 899.00
05.01.2013 0 25 2552 899.00
06.01.2013 0 25 2554 1709.05
15.01.2013 0 25 2555 1099.00
$> data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   date_block_num  object  
 2   shop_id         object  
 3   item_id         object  
 4   item_price      float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB

Then I have the following transformations:

num_column_transformer = ColumnTransformer(
    transformers=[
        ("std_scaler", StandardScaler(), make_column_selector(dtype_include=np.number)),
    ],
    remainder="passthrough"
)

num_pipeline = Pipeline(
    steps=[
        ("percent_item_cnt_day_per_shop", PercentOverTotalAttributeWholeAdder(
            attribute_percent_name="shop_id",
            attribute_total_name="item_cnt_day",
            new_attribute_name="%_item_cnt_day_per_shop")
        ),
        ("percent_item_cnt_day_per_item", PercentOverTotalAttributeWholeAdder(
            attribute_percent_name="item_id",
            attribute_total_name="item_cnt_day",
            new_attribute_name="%_item_cnt_day_per_item")
        ),
        ("percent_sales_per_shop", SalesPerAttributeOverTotalSalesAdder(
            attribute_percent_name="shop_id",
            new_attribute_name="%_sales_per_shop")
        ),
        ("percent_sales_per_item", SalesPerAttributeOverTotalSalesAdder(
            attribute_percent_name="item_id",
            new_attribute_name="%_sales_per_item")
        ),
        ("num_column_transformer", num_column_transformer),
    ]
)

The first four Transformers create four new different numeric variables and the last one applies StandardScaler over all the numerical values of the dataset.

After executing it, I get the following data:

0 1 2 3 4 5 6 7 8
-0.092652 -0.765612 -0.173122 -0.756606 -0.379775 02.01.2013 0 59 22154
-0.092652 1.557684 -0.175922 1.563224 -0.394319 03.01.2013 0 25 2552
-0.856351 1.557684 -0.175922 1.563224 -0.394319 05.01.2013 0 25 2552
-0.092652 1.557684 -0.17613 1.563224 -0.396646 06.01.2013 0 25 2554
-0.092652 1.557684 -0.173278 1.563224 -0.380647 15.01.2013 0 25 2555

I'd like to have the following output:

date date_block_num shop_id item_id item_price %_item_cnt_day_per_shop %_item_cnt_day_per_item %_sales_per_shop %_sales_per_item
02.01.2013 0 59 22154 -0.092652 -0.765612 -0.173122 -0.756606 -0.379775
03.01.2013 0 25 2552 -0.092652 1.557684 -0.175922 1.563224 -0.394319
05.01.2013 0 25 2552 -0.856351 1.557684 -0.175922 1.563224 -0.394319
06.01.2013 0 25 2554 -0.092652 1.557684 -0.17613 1.563224 -0.396646
15.01.2013 0 25 2555 -0.092652 1.557684 -0.173278 1.563224 -0.380647

As you can see, columns 5, 6, 7, and 8 from the output corresponds to the first four columns in the original dataset. For example, I don't know where the item_price feature lies in the outputted table.

  1. How could I preserve the column order and names? After that, I want to do feature engineering over categorical variables and my Transformers make use of the feature column name.
  2. Am I using correctly the Scikit-Learn API?
Dumb answered 21/8, 2021 at 15:42 Comment(0)
W
21

There's one point to be aware of when dealing with ColumnTransformer, which is reported within the doc as follows:

The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list.

That's the reason why your ColumnTransformer instance messes things up. Indeed, consider this simplified example which resembles your setting:

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

df = pd.DataFrame({
               'date': ['02.01.2013', '03.01.2013', '05.01.2013', '06.01.2013', '15.01.2013'], 
               'date_block_num': ['0', '0', '0', '0', '0'], 
               'shop_id': ['59', '25', '25', '25', '25'],
               'item_id': ['22514', '2252', '2252', '2254', '2255'], 
               'item_price': [999.00, 899.00, 899.00, 1709.05, 1099.00]})

ct = ColumnTransformer([
    ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))], 
    remainder='passthrough')

pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())

enter image description here

As you might notice, the first column in the transformed dataframe turns out to be the numeric one, i.e. the one which undergoes the scaling (and the first in the transformers list).

Conversely, here's an example of how you can bypass such issue by postponing the scaling on numeric variables after passing through all the string variables and thus ensuring the possibility of getting the columns in your desired order:

ct = ColumnTransformer([
    ('pass', 'passthrough', make_column_selector(dtype_include=object)),
    ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
])

pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())

enter image description here

To complete the picture, here is an attempt to reproduce your Pipeline (though the custom transformer is for sure slightly different from yours):

from sklearn.base import BaseEstimator, TransformerMixin

class PercentOverTotalAttributeWholeAdder(BaseEstimator, TransformerMixin):

    def __init__(self, attribute_percent_name='shop_id', new_attribute_name='%_item_cnt_day_per_shop'):
    self.attribute_percent_name = attribute_percent_name
    self.new_attribute_name = new_attribute_name
    
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        df[self.new_attribute_name] = df.groupby(by=self.attribute_percent_name)[self.attribute_percent_name].transform('count') / df.shape[0]
        return df

ct_pipe = ColumnTransformer([
    ('pass', 'passthrough', make_column_selector(dtype_include=object)),
    ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
    ], verbose_feature_names_out=False)

pipe = Pipeline([
    ('percent_item_cnt_day_per_shop', PercentOverTotalAttributeWholeAdder(
        attribute_percent_name='shop_id',
        new_attribute_name='%_item_cnt_day_per_shop')
    ),
    ('percent_item_cnt_day_per_item', PercentOverTotalAttributeWholeAdder(
        attribute_percent_name='item_id',
        new_attribute_name='%_item_cnt_day_per_item')
    ),
    ('column_trans', ct_pipe),
])

pd.DataFrame(pipe.fit_transform(df), columns=pipe[-1].get_feature_names_out())

enter image description here

As a final remark, observe that the verbose_feature_names_out=False parameter ensures that the names of the columns of the transformed dataframe do not show prefixes which refer to the different transformers in ColumnTransformer.

Weig answered 30/12, 2021 at 0:50 Comment(0)
M
4

Answer using scikit-learn 1.2.1

In scikit-learn 1.2 it's possible to set the output of the ColumnTransformer to a pandas dataframe, avoiding this transformation in a second step. Besides this, in the answer proposed by @amiola the ColumnTransformer makes use of a passthrough phase to preserve the order of string-type columns with respect to numeric ones, but this only works if all the string-type columns are before the numerics. To show this I use the same example converting the shop_id column to numeric:

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

df = pd.DataFrame({
               'date': ['02.01.2013', '03.01.2013', '05.01.2013', '06.01.2013', '15.01.2013'], 
               'date_block_num': ['0', '0', '0', '0', '0'], 
               'shop_id': [59, 25, 25, 25, 25],
               'item_id': ['22514', '2252', '2252', '2254', '2255'], 
               'item_price': [999.00, 899.00, 899.00, 1709.05, 1099.00]})
ct = ColumnTransformer([
         ('pass', 'passthrough', make_column_selector(dtype_include=object)),
         ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
                      ]).set_output(transform='pandas')
out_df = ct.fit_transform(df)
out_df
pass__date pass__date_block_num pass__item_id std_scaler__shop_id std_scaler__item_price
0 02.01.2013 0 22514 2.0 -0.402369
1 03.01.2013 0 2252 -0.5 -0.732153
2 05.01.2013 0 2252 -0.5 -0.732153
3 06.01.2013 0 2254 -0.5 1.939261
4 15.01.2013 0 2255 -0.5 -0.072585

As it's possible to see the shop_id column is moved to the end, for the same reason also explained in amiola's answer (i.e. columns are reordered following the order of the transformation in the ColumnTrasnformer). To overcome this issue, you can reorder dataframe columns after the transformation with verbose_feature_names_out set to False to preserve the same starting column names (beware that those names must be unique, see docs). There's also no need to create a specific passthrough step.

ct = ColumnTransformer([
    ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))],
     remainder='passthrough',
     verbose_feature_names_out=False).set_output(transform='pandas')

out_df = ct.fit_transform(df)
out_df = out_df[df.columns]
out_df
date date_block_num shop_id item_id item_price
0 02.01.2013 0 2.0 22514 -0.402369
1 03.01.2013 0 -0.5 2252 -0.732153
2 05.01.2013 0 -0.5 2252 -0.732153
3 06.01.2013 0 -0.5 2254 1.939261
4 15.01.2013 0 -0.5 2255 -0.072585
Machinery answered 14/2, 2023 at 10:10 Comment(1)
This is a great example why we need new answers to old questions.Quern

© 2022 - 2024 — McMap. All rights reserved.