Is it possible to add TransformedTargetRegressor into a scikit-learn pipeline?

I am setting up a predictive analytics pipeline on some data, and I am in the process of model selection. My target variable is skewed, so I would like to log-transform it in order to increase the performance of my linear regression estimators. I came across the relatively new TransformedTargetRegressor of scikit-learn, and I thought I could use it as part of a pipeline. I am attaching my code

My initial attempt was to transform y_train before calling gs.fit(), by casting it to np.log1p(y_train). This works, and I can perform the nested cross-validation and return the metrics of interest for all estimators. However, I would like to be able to get R^2 and RMSE for the trained model on previously unseen data (validation set), and I understand that in order to do that, I need to use (for example) r2_score function on y_val, preds, where the predictions need to have been transformed back to the real values, i.e., preds = np.expm1(gs.predict(X_val))

### Create a pipeline
pipe = Pipeline([
    # the transformer stage is populated by the param_grid
    ('transformer', TransformedTargetRegressor(func=np.log1p, inverse_func=np.expm1)),
    ('reg', DummyEstimator())  # Placeholder Estimator
])

### Candidate learning algorithms and their hyperparameters
alphas = [0.001, 0.01, 0.1, 1, 10, 100]
param_grid = [  
                {'transformer__regressor': Lasso(),
                 'reg': [Lasso()], # Actual Estimator
                 'reg__alpha': alphas},
                {'transformer__regressor': LassoLars(),
                 'reg': [LassoLars()], # Actual Estimator
                 'reg__alpha': alphas},
                {'transformer__regressor': Ridge(),
                 'reg': [Ridge()], # Actual Estimator
                 'reg__alpha': alphas},
                {'transformer__regressor': ElasticNet(),
                 'reg': [ElasticNet()], # Actual Estimator
                 'reg__alpha': alphas,
                 'reg__l1_ratio': [0.25, 0.5, 0.75]}]


### Create grid search (Inner CV)
gs = GridSearchCV(pipe, param_grid=param_grid, cv=5, verbose=2, n_jobs=-1,
                  scoring=scoring, refit='r2', return_train_score=True)


### Fit
best_model = gs.fit(X_train, y_train)

### scoring metrics for outer CV
scoring = ['neg_mean_absolute_error', 'r2', 'explained_variance', 'neg_mean_squared_error']

### Outer CV
linear_cv_results = cross_validate(gs, X_train, y_train_transformed,
                                   scoring=scoring, cv=5, verbose=3, return_train_score=True)

### Calculate mean metrics
train_r2 = (linear_cv_results['train_r2']).mean()
test_r2 = (linear_cv_results['test_r2']).mean()
train_mae = (-linear_cv_results['train_neg_mean_absolute_error']).mean()
test_mae = (-linear_cv_results['test_neg_mean_absolute_error']).mean()
train_exp_var = (linear_cv_results['train_explained_variance']).mean()
test_exp_var = (linear_cv_results['test_explained_variance']).mean()
train_rmse = (np.sqrt(-linear_cv_results['train_neg_mean_squared_error'])).mean()
test_rmse = (np.sqrt(-linear_cv_results['test_neg_mean_squared_error'])).mean()

Obviously this code snippet does not work, because apparently I can not add TransformedTargetRegressor into my pipeline, since it does not have a transform method (I get this TypeError: TypeError: All intermediate steps should be transformers and implement fit and transform).

Is there a "proper" way of doing this, or do I just have to take the log transformation of y_val on the fly when I want to call r2_score function etc?

No, because the scikit-learn original Pipeline does not change the y or the number of samples in X and y during the steps.

Your use-case is little unclear. What is the need of reg step if that same reg is already added to the TransformedTargetRegressor?

Looking at the documentation of TransformedTargetRegressor, the parameter regressor accepts a regressor (which can be also a pipeline which have some feature selection operations on X and a regressor at final stage). The working of TransformedTargetRegressor will be:

fit():

    regressor.fit(X, func(y))

predict():

    inverse_func(regressor.predict(X))

So there is no need to append that same regressor as a new step. Your model selection code now can be:

pipe = TransformedTargetRegressor(regressos = DummyEstimator(),
                                  func=np.log1p, 
                                  inverse_func=np.expm1)),

### Candidate learning algorithms and their hyperparameters
alphas = [0.001, 0.01, 0.1, 1, 10, 100]
param_grid = [  
                {'transformer__regressor': Lasso(),
                 'transformer__regressor__alpha': alphas},
                {'transformer__regressor': LassoLars(),
                 'transformer__regressor__alpha': alphas},
                {'transformer__regressor': Ridge(),
                 'transformer__regressor__alpha': alphas},
                {'transformer__regressor': ElasticNet(),
                 'transformer__regressor__alpha': alphas,
                 'transformer__regressor__l1_ratio': [0.25, 0.5, 0.75]}
              ]

Recommended topics

Hot tags