Consistent ColumnTransformer for intersecting lists of columns

Asked 5/6, 2020 at 22:54 Answered 29/5 at 11:49

python pandas scikit-learn scipy sklearn-pandas

I want to use sklearn.compose.ColumnTransformer consistently (not parallel, so, the second transformer should be executed only after the first) for intersecting lists of columns in this way:

log_transformer = p.FunctionTransformer(lambda x: np.log(x))
df = pd.DataFrame({'a': [1,2, np.NaN, 4], 'b': [1,np.NaN, 3, 4], 'c': [1 ,2, 3, 4]})
compose.ColumnTransformer(n_jobs=1,
                         transformers=[
                             ('num', impute.SimpleImputer() , ['a', 'b']),
                             ('log', log_transformer, ['b', 'c']),
                             ('scale', p.StandardScaler(), ['a', 'b', 'c'])
                         ]).fit_transform(df)

So, I want to use SimpleImputer for 'a', 'b', then log for 'b', 'c', and then StandardScaler for 'a', 'b', 'c'.

But:

I get array of (4, 7) shape.
I still get Nan in a and b columns.

So, how can I use ColumnTransformer for different columns in the manner of Pipeline?

UPD:

pipe_1 = pipeline.Pipeline(steps=[
    ('imp', impute.SimpleImputer(strategy='constant', fill_value=42)),
])

pipe_2 = pipeline.Pipeline(steps=[
    ('imp', impute.SimpleImputer(strategy='constant', fill_value=24)),
])

pipe_3 = pipeline.Pipeline(steps=[
    ('scl', p.StandardScaler()),
])

# in the real situation I don't know exactly what cols these arrays contain, so they are not static: 
cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']

proc = compose.ColumnTransformer(remainder='passthrough', transformers=[
    ('1', pipe_1, cols_1),
    ('2', pipe_2, cols_2),
    ('3', pipe_3, cols_3),
])
proc.fit_transform(df).T

Output:

array([[ 1.        ,  2.        , 42.        ,  4.        ],
       [ 1.        , 24.        ,  3.        ,  4.        ],
       [-1.06904497, -0.26726124,         nan,  1.33630621],
       [-1.33630621,         nan,  0.26726124,  1.06904497],
       [-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079]])

I understood why I have cols duplicates, nans and not scaled values, but how can I fix this in the correct way when cols are not static?

UPD2:

A problem may arise when the columns change their order. So, I want to use FunctionTransformer for columns selection:

def select_col(X, cols=None):
    return X[cols]

ct1 = compose.make_column_transformer(
    (p.OneHotEncoder(), p.FunctionTransformer(select_col, kw_args=dict(cols=['a', 'b']))),
    remainder='passthrough'
)

ct1.fit(df)

But get this output:

ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed

How can I fix it?

Calise answered 5/6, 2020 at 22:54 Comment(4)

In the update, I don't understand what you mean when you say "I don't know exactly what cols these arrays contain, so they are not static" – Hyozo 7/6, 2020 at 19:35

@BenReiniger this columns are created dynamically: e.g. I have skewness test, so col_1 array(for example) contains only skewed columns which should go to the log transformer. – Calise 7/6, 2020 at 19:53

The list of columns on which to apply each transformer can be given in different ways; if your skewness test can be encapsulated in a function, that can be used (see the docs, callable for columns). – Hyozo 7/6, 2020 at 23:47

Re: update2: That's not what FunctionTransformer does. – Hyozo 10/6, 2020 at 0:29

The intended usage of ColumnTransformer is that the different transformers are applied in parallel, not sequentially. To accomplish your desired outcome, three approaches come to mind:

First approach:

pipe_a = Pipeline(steps=[('imp', SimpleImputer()),
                         ('scale', StandardScaler())])
pipe_b = Pipeline(steps=[('imp', SimpleImputer()),
                         ('log', log_transformer),
                         ('scale', StandardScaler())])
pipe_c = Pipeline(steps=[('log', log_transformer),
                         ('scale', StandardScaler())])
proc = ColumnTransformer(transformers=[
    ('a', pipe_a, ['a']),
    ('b', pipe_b, ['b']),
    ('c', pipe_c, ['c'])]
)

Second approach:
This requires sklearn>1.2 and the pandas-out functionality that introduced. Without it, the ColumnTransformers will rearrange the columns and forget the names, so that the later ones will fail or apply to the wrong columns. For earlier versions, you may be able to tweak it for your specific usecase.

imp_tfm = ColumnTransformer(
    transformers=[('num', impute.SimpleImputer() , ['a', 'b'])],
    remainder='passthrough'
    )
log_tfm = ColumnTransformer(
    transformers=[('log', log_transformer, ['b', 'c'])],
    remainder='passthrough'
    )
scl_tfm = ColumnTransformer(
    transformers=[('scale', StandardScaler(), ['a', 'b', 'c'])
    )
proc = Pipeline(steps=[
    ('imp', imp_tfm),
    ('log', log_tfm),
    ('scale', scl_tfm)]
).set_output("pandas")

Third, there may be a way to use the Pipeline slicing feature to have one "master" pipeline that you cut down for each feature... this would work mostly like the first approach, might save some coding in the case of larger pipelines, but seems a little hacky. For example, here you can:

pipe_a = clone(pipe_b)[1:]
pipe_c = clone(pipe_b)
pipe_c.steps[1] = ('nolog', 'passthrough')

(Without cloning or otherwise deep-copying pipe_b, the last line would change both pipe_c and pipe_b. The slicing mechanism returns a copy, so pipe_a doesn't strictly need to be cloned, but I've left it in to feel safer. Unfortunately you can't provide a discontinuous slice, so pipe_c = pipe_b[0,2] doesn't work, but you can set the individual slices as I've done above to "passthrough" to disable them.)

Hyozo answered 6/6, 2020 at 15:49 Comment(8)

Thanks for your answer. I already thought about something like that. Unfortunately first method not very scalable and autonomous with respect to features - it will be very difficult to use all this manually when there are many features. Also, can you describe in more detail your last sentence "...there may be a way to use the Pipeline slicing feature to have one "master" pipeline that you cut down for each feature..."? – Calise 6/6, 2020 at 22:25

Of course, in the first approach you don't need a separate pipe for every feature; just for each unique list of transformations you want to apply. (E.g., send feature c also to pipe_b.) – Hyozo 7/6, 2020 at 1:29

Please, check my updated question, I tried to explain what I meant. – Calise 7/6, 2020 at 13:54

I found out how we can do this, your second option is the starting point. More info. – Calise 8/6, 2020 at 20:38

@BenReiniger Following your first approach, can you please let me know what should I do if I had to apply the pipe_a column a,b and if I had to apply the pipe_b to b,c (some the pipes have some columns in common). Thank you – Perigee 25/8, 2022 at 0:50

@Perigee The point of this "one pipeline per combination of preprocessing steps" approach is that there are no shared columns between pipes. The imputer works on columns a and b, but because b gets log-transformed while a doesn't, they get sent to different pipelines. – Hyozo 25/8, 2022 at 12:30

@BenReiniger Thanks that makes sense. I was just asking, what should be the approach if I had usecase where I have different transformations and some of them have some shared columns – Perigee 26/8, 2022 at 2:59

The second approach should now be workable, using the new pandas-out functionality in sklearn; you can specify the column names instead of their indices. – Hyozo 9/2, 2023 at 19:56

We can use little columns_name_to_index hack to convert column names to index and then we can pass the dataframe to the pipeline like this:

def columns_name_to_index(arr_of_names, df):
    return [df.columns.get_loc(c) for c in arr_of_names if c in df]

cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']

ct1 = compose.ColumnTransformer(remainder='passthrough', transformers=[
    (impute.SimpleImputer(strategy='constant', fill_value=42), columns_name_to_index(cols_1, df)),
    (impute.SimpleImputer(strategy='constant', fill_value=24), columns_name_to_index(cols_2, df)),
])

ct2 = compose.ColumnTransformer(remainder='passthrough', transformers=[
    (p.StandardScaler(), columns_name_to_index(cols_3, df)),
])

pipe = pipeline.Pipeline(steps=[
    ('ct1', ct1),
    ('ct2', ct2),
])

pipe.fit_transform(df).T

Calise answered 8/6, 2020 at 20:36 Comment(4)

I'm glad this works for your needs; I guess it's the "you may be able to tweak it for your specific usecase" part of my description of my second option. But just a word of caution, to reiterate: "ColumnTransformer will rearrange the columns..." so that if your ct2 didn't operate on the entire frame, and ct1 had its transformers in a different order (or included a transformer that added/dropped a column), this would fail because your columns_name_to_index refers to the index in the original df. – Hyozo 8/6, 2020 at 21:0

@BenReiniger then I think e.g. OneHotEncoder can produce a problem. Can you suggest a fix? – Calise 8/6, 2020 at 21:6

In all my projects, I only have a handful of different pipelines I would apply, so my first option is the most applicable. If that's not the case for your data, maybe if you provide a more representative example we can hack something together. – Hyozo 8/6, 2020 at 21:25

@BenReiniger I tried to use FunctionTransformer but get an error, please check the UPD2 in the question. – Calise 9/6, 2020 at 15:22

I could not comment on the answer by Ben Reiniger due to the low score on my account. That's why I added another answer.

To use the second approach of Ben Reiniger, verbose_feature_names_out variable of ColumnTransformer has to be set to False. Otherwise, column names are changed in every transformer step, and they become intangible for the next transforming step. Additionally, if exists, sparse_output should be set False if default of it is True. (Like in OneHotEncoder)

For ex: (Firstly, check out the second approach of Ben Reiniger's post above.)

In the entrance of the second transformer, the data columns are changed to ["num__a", "num__b", "remainder__c"]. Since the second step waits ["b", "c"] columns, it will raise column not found error.

Cammie answered 29/5 at 11:49 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags