Predicting unseen data on Target variable based Encoding Technique

Asked 9/11, 2020 at 21:34 Answered 26/8, 2022 at 12:30

python machine-learning encoding prediction test-data

I am working on an automated ML (Regression) algorithm where the flow of process is: User uploads a data -- Data Cleaning -- Encoding(Target Encoder) -- Fitting model -- results.

I am completely fine until this point, my confusion is when the user wants to test this in an unseen data without target variable, then I need to again perform Data cleaning -- Encoding and encoding technique I have used while fitting the model can work only if there is a target variable (unseen data will not have a target variable) and I cannot change the Encoding technique on unseen data as the testing data needs to go through the same procedure as the data used while fitting the model as per my knowledge.

Could someone please help me in finding a way to overcome this issue or any suggestions would be of great help.

Thanks in advace.

Freemason answered 9/11, 2020 at 21:34 Comment(0)

I know this is kinda late but I think this is a useful question, but possibly it should be migrated over at Crossvalidated. Also I think that none of the answers correctly address the problem. I'll try to answer regardless.

I suppose that you need to use the encode in the training set to encode the new observation in the test set.

So, with a silly example, suppose your encoded X feature is "type of fruit" and you "target encoded" the value "Banana" with the value "0.7" (the result of your target encoding). Then if a user inputs "Banana" to get the prediction you encode it with the value "0.7" to make the model work.

This way the user gives you "Banana", you "translate" it with "0.7", pass this information to the model and receive the prediction for "0.7". So you give the prediction back to the user.

But if even in your training set you don't have the observation "Banana" (so it is truly "unseen") then this is a different problem. My guess is that you could encode all the unseen levels of X with the simple mean of the target. And use that for the prediction.

Like any other type of encoding, even Target Encoding has it's draw backs, for reference I've found this post quite helpful: https://maxhalford.github.io/blog/target-encoding/.

Apprise answered 26/8, 2022 at 12:30 Comment(0)

For predictions on unseen data, you should simply omit the target encoding from your pipeline. You can therefore implement two versions of the pipeline.

This would be your training/testing/cross-validation pipeline:

User uploads training data -- Data Cleaning -- Encoding(Target Encoder) -- Training the model -- result

Note: use fit_transform on the encoding when running training data, and transform when running test or validation data to avoid data leakage

And this would be your prediction pipeline:

User uploads testing data -- Data Cleaning -- predictions using trained models

Dulsea answered 29/11, 2020 at 17:14 Comment(1)

Hi LazyEval, Thanks for your answer, but when I fit the model using my encoded training data, all data are numerical and when I pass test data (which has two columns categorical values) to the .pkl file, it throws an error 'pkl has not seen the data before'. – Freemason 26/1, 2021 at 21:21

-1

What you have outlined above is the training pipeline. In a test (inference) scenario, the pipeline would be slightly modified. Data upload and data cleaning should be performed identically as in the training scenario, but as you acknowledge there is no need (or even possibility) for performing target encoding since the target is what we are trying to predict using the model during testing. In this case encoding is not performed, and the model is used to predict the target based on the cleaned data.

In short, the model pipeline should be nearly identical for train/test, with the exceptions that target encoding is not performed in the test scenario, and the final step will be a fit in the train scenario and a predict in the testing scenario.

Kamerad answered 29/11, 2020 at 16:58 Comment(2)

Hi Matt, Thanks for your answer, but when I fit the model using my encoded training data, all data are numerical and when I pass test data (which has two columns categorical values) to the .pkl file, it throws an error 'pkl has not seen the data before'. – Freemason 26/1, 2021 at 21:20

This answer definitly understands the problem, but to me it is mostly reformulating it. I still wonder what @Freemason commented: How do we encode a category in the testset when we do not have access to the target? – Gamy 11/1, 2023 at 14:5

-1

You question is not so clear.

But let's assume two scenarios:

you are encoding in some strange way the input features with target info
you are simply encoding the regression values, and, for example, you have set buckets values for the outcome:

something like

[0, 20] -> 1
[21, 40] -> 2
[41, 60] -> 3
[61, 80] -> 4
[81, 100] -> 5

Answer

If you are encoding your features with the the target value in some strange way, there is something wrong. You are basically introducing the info that you are trying to predict in the source, and that's a data leak. A model with this kind of set-up will not work on real data, because it's like cheating
If you are encoding your targets, which is quite common, there is nothing to change in your pipeline. Because features encoding and target encoding are two separate and independent steps

Usually there will be two encoder functions, one for the features, and one for the target, and these functions will be independent. In the training situation, you'll have a g(x) encoder for features (with x being the input matrix features), and t(y) function for encoding the target (with y being the target values)

when you do the training, you need both encoded features and encoded labels to calculate errors and improve the model, so you'll do something like this:

model.fit(g(x_train), t(y_train)) # iterate: train data on g(x_train), calculate loss with t(y_train) and change the model accordingly

when you'll do the prediction you'll work with something like this:

y_test = model.predict(g(x_test)) # test with encoded unseen data

Assuming the scenario in which you have used the example buckets above, y_test will be already encoded, with values like [1, 2, 3, 4, 5]. So there will be no need to use the encoding of your target output, because the target is the output of your prediction, and in no way it should be used as info, encoded in some way, in the training features

The target should be used only in the loss function during the training

So, in summary:

Training could have two independent encoders

g(x) for x features
t(y) for y labels/target

and these encoders are independent (e.g. they are constructed without knowing anything of eachother)

Test on unseen data can have only encoders for the features (already obtained from the training)

g(x) for x features

Oxazine answered 29/11, 2020 at 17:39 Comment(2)

"A model with this kind of set-up will not work on real data, because it's like cheating"..mmm no it's not. Target Encoding it's quite used, for example see: docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/… – Apprise 26/8, 2022 at 10:44

@Apprise thanks for the link. I didn't know this kind of target handling. Anyway as I can see there is a data leakage, and it should be handled, and it was the reason of my explanation :) docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/… – Oxazine 26/8, 2022 at 16:5

Recommended topics

Hot tags