How to handle category mismatch after onehotencoding from test data while predicting?
Asked Answered
T

2

6

I'm sorry if the title of the question is not that clear, I could not sum up the problem in one line.

Here are the simplified datasets for an explanation. Basically, the number of categories in the training set is much larger than the categories in the test set, because of which there is a difference in the number of columns in the test and training set after OneHotEncoding. How can I handle this problem?

Training Set

+-------+----------+
| Value | Category |
+-------+----------+
| 100   | SE1      |
+-------+----------+
| 200   | SE2      |
+-------+----------+
| 300   | SE3      |
+-------+----------+

Training set after OneHotEncoding

+-------+-----------+-----------+-----------+
| Value | DummyCat1 | DummyCat2 | DummyCat3 |
+-------+-----------+-----------+-----------+
| 100   | 1         | 0         | 0         |
+-------+-----------+-----------+-----------+
| 200   | 0         | 1         | 0         |
+-------+-----------+-----------+-----------+
| 300   | 0         | 0         | 1         |
+-------+-----------+-----------+-----------+

Test Set

+-------+----------+
| Value | Category |
+-------+----------+
| 100   | SE1      |
+-------+----------+
| 200   | SE1      |
+-------+----------+
| 300   | SE2      |
+-------+----------+

Test set after OneHotEncoding

+-------+-----------+-----------+
| Value | DummyCat1 | DummyCat2 |
+-------+-----------+-----------+
| 100   | 1         | 0         |
+-------+-----------+-----------+
| 200   | 1         | 0         |
+-------+-----------+-----------+
| 300   | 0         | 1         |
+-------+-----------+-----------+

As you can notice, the training set after the OneHotEncoding is of shape (3,4) while the test set after OneHotEncoding is of shape (3,3). Because of this, when I do the following code (y_train is a column vector of shape (3,))

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)

x_pred = regressor.predict(x_test)

I get the error at the predict function. As you can see, the dimensions in the error are quite large, unlike the basic examples.

  Traceback (most recent call last):

  File "<ipython-input-2-5bac76b24742>", line 30, in <module>
    x_pred = regressor.predict(x_test)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 256, in predict
    return self._decision_function(X)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 241, in _decision_function
    dense_output=True) + self.intercept_

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/extmath.py", line 140, in safe_sparse_dot
    return np.dot(a, b)

ValueError: shapes (4801,2236) and (4033,) not aligned: 2236 (dim 1) != 4033 (dim 0)
Tuber answered 13/12, 2017 at 6:11 Comment(10)
You need the same set of features in x_train and x_test. If the factors in your training data have more levels, either represent them explicitly in your test data or drop the ones from x_train that cannot be applied to your test set.Pax
@Pax Are there any straightforward way to remove the extra levels from x_train?Tuber
If you dummify before splitting into train and test, you may avoid this issue. But if you can't do that for some reason, something like x_train.drop(x_train.columns[~x_train.columns.isin(x_test.columns)], 1) should work.Pax
I am not splitting the data, they already come as different CSVs. Lemme try out the code piece.Tuber
@Pax It's giving some issues. Can I mail you the dataset and my sample code? It's very basic. Would be great if you can take the time to check it out.Tuber
Please either update your question with more detail or post a question more specific to the problem you're encountering.Pax
How are you doing the one-hot encoding? Using pd.get_dummies?? Or scikit OneHotEncoder?Labors
@VivekKumar I am doing OneHotEncoder of scikit-learnTuber
In that case, just use the same object with which you transformed the train data and use transform(x_test). I'm assuming you are currently using fit_transform() on test data, but it fit() or fit_transform() should only be used with train data and only transform() on test data. If you share the code by which you encode the x_train and x_test, I can add an answer to help you.Labors
@VivekKumar This worked for me. Can you put up a detailed answer why it worked so that it will be helpful for everyone? I'll accept it.Tuber
L
2

You have to transform the x_test the same way in which x_train was transformed.

x_test = onehotencoder.transform(x_test)
x_pred = regressor.predict(x_test)

Make sure use the same onehotencoder object which was used to fit() on x_train.

I'm assuming that you are currently using fit_transform() on test data. Doing fit() or fit_transform() forgets the previously learnt data and re-fits the oneHotEncoder. It will now think that there are only two distinct values present in the column and hence will change the shape of output.

Labors answered 13/12, 2017 at 15:14 Comment(5)
ValueError: unknown categorical feature present [462 61 462 ..., 61 61 462] during transform. How do you handle this though? This was on another dataset, I think now, there are more categorical data in the test set.Tuber
@ParthapratimNeog Yes. Thats the case. In this case you need to think about what you will do in real world case where new unseed data contains something which you have not learnt about. How do you handle them then? Do you then drop those cases or train again on the whole data.Labors
Ideally, I would want to train em again, but, they are not there in the training data, so how do I do that? I tried ignoring them for now. Sorry, I am really new to this :)Tuber
@ParthapratimNeog You need to include those into train and then train again. For now, you can drop those rows which have such data present. Also just in case of dummy values, using the OneHotEncoder on whole dataset and then split them into train and test. It wouldnt hurt performance or overfit.Labors
@VivekKumar: the problem when using the OneHotEncoder on the whole dataset and then split into train and test is the existence of new values in test data (but the values were not in the training data). It is still better to transform the training and test data separately.Photograph
J
0

There are two cases:

i) train data feature/column having more categories than test column

ii) test data feature/column having more categories than corresponding train column

test data should only be transformed using encoding, not fit&transform.

The general case of OHE usage:

onehotencoder=OneHotEncoder()
enc_data_train=onehotencoder.fit_transform(X_train[cat_columns]).toarray())
X_train=X_train[num_columns].join(enc_data_train)

enc_data_test=onehotencoder.transform(X_test[cat_columns]).toarray())
X_test=X_test[num_columns].join(enc_data_test)

cat_columns are the categorical columns and num_columns are the numerical columns. You never fit X_test. The following code is wrong

X_test=onehotencoder.fit_transform(X_test[cat_columns]).toarray()). 

this is not how we should encode test data.

Now coming to the problem of mismatch.Train or Test having different number of categories, so different number of columns.

Two ways to solve it:

i) fit using the entire data (train & test) and only transform X_train and X_test

ii) ignore new features in test

i) example code

onehotencoder=OneHotEncoder()
enc_data=onehotencoder.fit(X[cat_columns])

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0, 
                                   train_size = .75)


enc_data_train= pd.DataFrame 
                (onehotencoder.transform(X_train[cat_columns]).toarray())

X_train=X_train[[num_columns]].join(enc_data_train)


enc_data_test=pd.DataFrame 
              (onehotencoder.transform(X_test[cat_columns]).toarray())

X_test=X_test[[num_columns]].join(enc_data_test)

ii) using handle_unknown, fit&transform using train

onehotencoder=OneHotEncoder(handle_unknown='ignore')

enc_data_train=pd.DataFrame 
               (onehotencoder.fit_transform(X_train[cat_columns]).toarray())

X_train=X_train[[num_columns]].join(enc_data_train)

enc_data_test=pd.DataFrame 
             (onehotencoder.transform(X_test[cat_columns]).toarray())

X_test=X_test[[num_columns]].join(enc_data_test)

This second way will ignore the new features in test. The number of columns will be same in both train and test. This method assumes that the representation of new category in test is less significant. Even if the new category is significant, it is not taken in training model as it wasn't present in training set, so the impact won't be reflected in the model.

Jaquelynjaquenetta answered 11/9, 2022 at 17:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.