Consistent ColumnTransformer for intersecting lists of columns
Asked Answered
C

3

7

I want to use sklearn.compose.ColumnTransformer consistently (not parallel, so, the second transformer should be executed only after the first) for intersecting lists of columns in this way:

log_transformer = p.FunctionTransformer(lambda x: np.log(x))
df = pd.DataFrame({'a': [1,2, np.NaN, 4], 'b': [1,np.NaN, 3, 4], 'c': [1 ,2, 3, 4]})
compose.ColumnTransformer(n_jobs=1,
                         transformers=[
                             ('num', impute.SimpleImputer() , ['a', 'b']),
                             ('log', log_transformer, ['b', 'c']),
                             ('scale', p.StandardScaler(), ['a', 'b', 'c'])
                         ]).fit_transform(df)

So, I want to use SimpleImputer for 'a', 'b', then log for 'b', 'c', and then StandardScaler for 'a', 'b', 'c'.

But:

  1. I get array of (4, 7) shape.
  2. I still get Nan in a and b columns.

So, how can I use ColumnTransformer for different columns in the manner of Pipeline?

UPD:

pipe_1 = pipeline.Pipeline(steps=[
    ('imp', impute.SimpleImputer(strategy='constant', fill_value=42)),
])

pipe_2 = pipeline.Pipeline(steps=[
    ('imp', impute.SimpleImputer(strategy='constant', fill_value=24)),
])

pipe_3 = pipeline.Pipeline(steps=[
    ('scl', p.StandardScaler()),
])

# in the real situation I don't know exactly what cols these arrays contain, so they are not static: 
cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']

proc = compose.ColumnTransformer(remainder='passthrough', transformers=[
    ('1', pipe_1, cols_1),
    ('2', pipe_2, cols_2),
    ('3', pipe_3, cols_3),
])
proc.fit_transform(df).T

Output:

array([[ 1.        ,  2.        , 42.        ,  4.        ],
       [ 1.        , 24.        ,  3.        ,  4.        ],
       [-1.06904497, -0.26726124,         nan,  1.33630621],
       [-1.33630621,         nan,  0.26726124,  1.06904497],
       [-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079]])

I understood why I have cols duplicates, nans and not scaled values, but how can I fix this in the correct way when cols are not static?

UPD2:

A problem may arise when the columns change their order. So, I want to use FunctionTransformer for columns selection:

def select_col(X, cols=None):
    return X[cols]

ct1 = compose.make_column_transformer(
    (p.OneHotEncoder(), p.FunctionTransformer(select_col, kw_args=dict(cols=['a', 'b']))),
    remainder='passthrough'
)

ct1.fit(df)

But get this output:

ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed

How can I fix it?

Calise answered 5/6, 2020 at 22:54 Comment(4)
In the update, I don't understand what you mean when you say "I don't know exactly what cols these arrays contain, so they are not static"Hyozo
@BenReiniger this columns are created dynamically: e.g. I have skewness test, so col_1 array(for example) contains only skewed columns which should go to the log transformer.Calise
The list of columns on which to apply each transformer can be given in different ways; if your skewness test can be encapsulated in a function, that can be used (see the docs, callable for columns).Hyozo
Re: update2: That's not what FunctionTransformer does.Hyozo
H
5

The intended usage of ColumnTransformer is that the different transformers are applied in parallel, not sequentially. To accomplish your desired outcome, three approaches come to mind:

First approach:

pipe_a = Pipeline(steps=[('imp', SimpleImputer()),
                         ('scale', StandardScaler())])
pipe_b = Pipeline(steps=[('imp', SimpleImputer()),
                         ('log', log_transformer),
                         ('scale', StandardScaler())])
pipe_c = Pipeline(steps=[('log', log_transformer),
                         ('scale', StandardScaler())])
proc = ColumnTransformer(transformers=[
    ('a', pipe_a, ['a']),
    ('b', pipe_b, ['b']),
    ('c', pipe_c, ['c'])]
)

Second approach:
This requires sklearn>1.2 and the pandas-out functionality that introduced. Without it, the ColumnTransformers will rearrange the columns and forget the names, so that the later ones will fail or apply to the wrong columns. For earlier versions, you may be able to tweak it for your specific usecase.

imp_tfm = ColumnTransformer(
    transformers=[('num', impute.SimpleImputer() , ['a', 'b'])],
    remainder='passthrough'
    )
log_tfm = ColumnTransformer(
    transformers=[('log', log_transformer, ['b', 'c'])],
    remainder='passthrough'
    )
scl_tfm = ColumnTransformer(
    transformers=[('scale', StandardScaler(), ['a', 'b', 'c'])
    )
proc = Pipeline(steps=[
    ('imp', imp_tfm),
    ('log', log_tfm),
    ('scale', scl_tfm)]
).set_output("pandas")

Third, there may be a way to use the Pipeline slicing feature to have one "master" pipeline that you cut down for each feature... this would work mostly like the first approach, might save some coding in the case of larger pipelines, but seems a little hacky. For example, here you can:

pipe_a = clone(pipe_b)[1:]
pipe_c = clone(pipe_b)
pipe_c.steps[1] = ('nolog', 'passthrough')

(Without cloning or otherwise deep-copying pipe_b, the last line would change both pipe_c and pipe_b. The slicing mechanism returns a copy, so pipe_a doesn't strictly need to be cloned, but I've left it in to feel safer. Unfortunately you can't provide a discontinuous slice, so pipe_c = pipe_b[0,2] doesn't work, but you can set the individual slices as I've done above to "passthrough" to disable them.)

Hyozo answered 6/6, 2020 at 15:49 Comment(8)
Thanks for your answer. I already thought about something like that. Unfortunately first method not very scalable and autonomous with respect to features - it will be very difficult to use all this manually when there are many features. Also, can you describe in more detail your last sentence "...there may be a way to use the Pipeline slicing feature to have one "master" pipeline that you cut down for each feature..."?Calise
Of course, in the first approach you don't need a separate pipe for every feature; just for each unique list of transformations you want to apply. (E.g., send feature c also to pipe_b.)Hyozo
Please, check my updated question, I tried to explain what I meant.Calise
I found out how we can do this, your second option is the starting point. More info.Calise
@BenReiniger Following your first approach, can you please let me know what should I do if I had to apply the pipe_a column a,b and if I had to apply the pipe_b to b,c (some the pipes have some columns in common). Thank youPerigee
@Perigee The point of this "one pipeline per combination of preprocessing steps" approach is that there are no shared columns between pipes. The imputer works on columns a and b, but because b gets log-transformed while a doesn't, they get sent to different pipelines.Hyozo
@BenReiniger Thanks that makes sense. I was just asking, what should be the approach if I had usecase where I have different transformations and some of them have some shared columnsPerigee
The second approach should now be workable, using the new pandas-out functionality in sklearn; you can specify the column names instead of their indices.Hyozo
C
1

We can use little columns_name_to_index hack to convert column names to index and then we can pass the dataframe to the pipeline like this:

def columns_name_to_index(arr_of_names, df):
    return [df.columns.get_loc(c) for c in arr_of_names if c in df]

cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']

ct1 = compose.ColumnTransformer(remainder='passthrough', transformers=[
    (impute.SimpleImputer(strategy='constant', fill_value=42), columns_name_to_index(cols_1, df)),
    (impute.SimpleImputer(strategy='constant', fill_value=24), columns_name_to_index(cols_2, df)),
])

ct2 = compose.ColumnTransformer(remainder='passthrough', transformers=[
    (p.StandardScaler(), columns_name_to_index(cols_3, df)),
])

pipe = pipeline.Pipeline(steps=[
    ('ct1', ct1),
    ('ct2', ct2),
])

pipe.fit_transform(df).T
Calise answered 8/6, 2020 at 20:36 Comment(4)
I'm glad this works for your needs; I guess it's the "you may be able to tweak it for your specific usecase" part of my description of my second option. But just a word of caution, to reiterate: "ColumnTransformer will rearrange the columns..." so that if your ct2 didn't operate on the entire frame, and ct1 had its transformers in a different order (or included a transformer that added/dropped a column), this would fail because your columns_name_to_index refers to the index in the original df.Hyozo
@BenReiniger then I think e.g. OneHotEncoder can produce a problem. Can you suggest a fix?Calise
In all my projects, I only have a handful of different pipelines I would apply, so my first option is the most applicable. If that's not the case for your data, maybe if you provide a more representative example we can hack something together.Hyozo
@BenReiniger I tried to use FunctionTransformer but get an error, please check the UPD2 in the question.Calise
C
1

I could not comment on the answer by Ben Reiniger due to the low score on my account. That's why I added another answer.

To use the second approach of Ben Reiniger, verbose_feature_names_out variable of ColumnTransformer has to be set to False. Otherwise, column names are changed in every transformer step, and they become intangible for the next transforming step. Additionally, if exists, sparse_output should be set False if default of it is True. (Like in OneHotEncoder)

For ex: (Firstly, check out the second approach of Ben Reiniger's post above.)

In the entrance of the second transformer, the data columns are changed to ["num__a", "num__b", "remainder__c"]. Since the second step waits ["b", "c"] columns, it will raise column not found error.

Cammie answered 29/5 at 11:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.