Preserve column order after applying sklearn.compose.ColumnTransformer

Asked 21/8, 2021 at 15:42 Answered 14/2, 2023 at 10:10

I'm using Pipeline and ColumnTransformer modules from sklearn library to perform feature engineering on my dataset.

The dataset initially looks like this:

date	shop_id	item_id	item_price
02.01.2013	59	22154	999.00
03.01.2013	25	2552	899.00
05.01.2013	25	2552	899.00
06.01.2013	25	2554	1709.05
15.01.2013	25	2555	1099.00

$> data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   date_block_num  object  
 2   shop_id         object  
 3   item_id         object  
 4   item_price      float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB

Then I have the following transformations:

num_column_transformer = ColumnTransformer(
    transformers=[
        ("std_scaler", StandardScaler(), make_column_selector(dtype_include=np.number)),
    ],
    remainder="passthrough"
)

num_pipeline = Pipeline(
    steps=[
        ("percent_item_cnt_day_per_shop", PercentOverTotalAttributeWholeAdder(
            attribute_percent_name="shop_id",
            attribute_total_name="item_cnt_day",
            new_attribute_name="%_item_cnt_day_per_shop")
        ),
        ("percent_item_cnt_day_per_item", PercentOverTotalAttributeWholeAdder(
            attribute_percent_name="item_id",
            attribute_total_name="item_cnt_day",
            new_attribute_name="%_item_cnt_day_per_item")
        ),
        ("percent_sales_per_shop", SalesPerAttributeOverTotalSalesAdder(
            attribute_percent_name="shop_id",
            new_attribute_name="%_sales_per_shop")
        ),
        ("percent_sales_per_item", SalesPerAttributeOverTotalSalesAdder(
            attribute_percent_name="item_id",
            new_attribute_name="%_sales_per_item")
        ),
        ("num_column_transformer", num_column_transformer),
    ]
)

The first four Transformers create four new different numeric variables and the last one applies StandardScaler over all the numerical values of the dataset.

After executing it, I get the following data:

0	1	2	3	4	5	7	8
-0.092652	-0.765612	-0.173122	-0.756606	-0.379775	02.01.2013	59	22154
-0.092652	1.557684	-0.175922	1.563224	-0.394319	03.01.2013	25	2552
-0.856351	1.557684	-0.175922	1.563224	-0.394319	05.01.2013	25	2552
-0.092652	1.557684	-0.17613	1.563224	-0.396646	06.01.2013	25	2554
-0.092652	1.557684	-0.173278	1.563224	-0.380647	15.01.2013	25	2555

I'd like to have the following output:

date	shop_id	item_id	item_price	%_item_cnt_day_per_shop	%_item_cnt_day_per_item	%_sales_per_shop	%_sales_per_item
02.01.2013	59	22154	-0.092652	-0.765612	-0.173122	-0.756606	-0.379775
03.01.2013	25	2552	-0.092652	1.557684	-0.175922	1.563224	-0.394319
05.01.2013	25	2552	-0.856351	1.557684	-0.175922	1.563224	-0.394319
06.01.2013	25	2554	-0.092652	1.557684	-0.17613	1.563224	-0.396646
15.01.2013	25	2555	-0.092652	1.557684	-0.173278	1.563224	-0.380647

As you can see, columns 5, 6, 7, and 8 from the output corresponds to the first four columns in the original dataset. For example, I don't know where the item_price feature lies in the outputted table.

How could I preserve the column order and names? After that, I want to do feature engineering over categorical variables and my Transformers make use of the feature column name.
Am I using correctly the Scikit-Learn API?

Dumb answered 21/8, 2021 at 15:42 Comment(0)

There's one point to be aware of when dealing with ColumnTransformer, which is reported within the doc as follows:

The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list.

That's the reason why your ColumnTransformer instance messes things up. Indeed, consider this simplified example which resembles your setting:

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

df = pd.DataFrame({
               'date': ['02.01.2013', '03.01.2013', '05.01.2013', '06.01.2013', '15.01.2013'], 
               'date_block_num': ['0', '0', '0', '0', '0'], 
               'shop_id': ['59', '25', '25', '25', '25'],
               'item_id': ['22514', '2252', '2252', '2254', '2255'], 
               'item_price': [999.00, 899.00, 899.00, 1709.05, 1099.00]})

ct = ColumnTransformer([
    ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))], 
    remainder='passthrough')

pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())

As you might notice, the first column in the transformed dataframe turns out to be the numeric one, i.e. the one which undergoes the scaling (and the first in the transformers list).

Conversely, here's an example of how you can bypass such issue by postponing the scaling on numeric variables after passing through all the string variables and thus ensuring the possibility of getting the columns in your desired order:

ct = ColumnTransformer([
    ('pass', 'passthrough', make_column_selector(dtype_include=object)),
    ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
])

pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())

To complete the picture, here is an attempt to reproduce your Pipeline (though the custom transformer is for sure slightly different from yours):

from sklearn.base import BaseEstimator, TransformerMixin

class PercentOverTotalAttributeWholeAdder(BaseEstimator, TransformerMixin):

    def __init__(self, attribute_percent_name='shop_id', new_attribute_name='%_item_cnt_day_per_shop'):
    self.attribute_percent_name = attribute_percent_name
    self.new_attribute_name = new_attribute_name
    
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        df[self.new_attribute_name] = df.groupby(by=self.attribute_percent_name)[self.attribute_percent_name].transform('count') / df.shape[0]
        return df

ct_pipe = ColumnTransformer([
    ('pass', 'passthrough', make_column_selector(dtype_include=object)),
    ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
    ], verbose_feature_names_out=False)

pipe = Pipeline([
    ('percent_item_cnt_day_per_shop', PercentOverTotalAttributeWholeAdder(
        attribute_percent_name='shop_id',
        new_attribute_name='%_item_cnt_day_per_shop')
    ),
    ('percent_item_cnt_day_per_item', PercentOverTotalAttributeWholeAdder(
        attribute_percent_name='item_id',
        new_attribute_name='%_item_cnt_day_per_item')
    ),
    ('column_trans', ct_pipe),
])

pd.DataFrame(pipe.fit_transform(df), columns=pipe[-1].get_feature_names_out())

As a final remark, observe that the verbose_feature_names_out=False parameter ensures that the names of the columns of the transformed dataframe do not show prefixes which refer to the different transformers in ColumnTransformer.

Weig answered 30/12, 2021 at 0:50 Comment(0)

Answer using scikit-learn 1.2.1

In scikit-learn 1.2 it's possible to set the output of the ColumnTransformer to a pandas dataframe, avoiding this transformation in a second step. Besides this, in the answer proposed by @amiola the ColumnTransformer makes use of a passthrough phase to preserve the order of string-type columns with respect to numeric ones, but this only works if all the string-type columns are before the numerics. To show this I use the same example converting the shop_id column to numeric:

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

df = pd.DataFrame({
               'date': ['02.01.2013', '03.01.2013', '05.01.2013', '06.01.2013', '15.01.2013'], 
               'date_block_num': ['0', '0', '0', '0', '0'], 
               'shop_id': [59, 25, 25, 25, 25],
               'item_id': ['22514', '2252', '2252', '2254', '2255'], 
               'item_price': [999.00, 899.00, 899.00, 1709.05, 1099.00]})
ct = ColumnTransformer([
         ('pass', 'passthrough', make_column_selector(dtype_include=object)),
         ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
                      ]).set_output(transform='pandas')
out_df = ct.fit_transform(df)
out_df

	pass__date	pass__item_id	std_scaler__shop_id	std_scaler__item_price
0	02.01.2013	22514	2.0	-0.402369
1	03.01.2013	2252	-0.5	-0.732153
2	05.01.2013	2252	-0.5	-0.732153
3	06.01.2013	2254	-0.5	1.939261
4	15.01.2013	2255	-0.5	-0.072585

As it's possible to see the shop_id column is moved to the end, for the same reason also explained in amiola's answer (i.e. columns are reordered following the order of the transformation in the ColumnTrasnformer). To overcome this issue, you can reorder dataframe columns after the transformation with verbose_feature_names_out set to False to preserve the same starting column names (beware that those names must be unique, see docs). There's also no need to create a specific passthrough step.

ct = ColumnTransformer([
    ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))],
     remainder='passthrough',
     verbose_feature_names_out=False).set_output(transform='pandas')

out_df = ct.fit_transform(df)
out_df = out_df[df.columns]
out_df

	date	shop_id	item_id	item_price
0	02.01.2013	2.0	22514	-0.402369
1	03.01.2013	-0.5	2252	-0.732153
2	05.01.2013	-0.5	2252	-0.732153
3	06.01.2013	-0.5	2254	1.939261
4	15.01.2013	-0.5	2255	-0.072585

Machinery answered 14/2, 2023 at 10:10 Comment(1)

This is a great example why we need new answers to old questions. – Quern 23/4 at 12:55

Answer using scikit-learn 1.2.1

Recommended topics

Hot tags