I'm sorry if the title of the question is not that clear, I could not sum up the problem in one line.
Here are the simplified datasets for an explanation. Basically, the number of categories in the training set is much larger than the categories in the test set, because of which there is a difference in the number of columns in the test and training set after OneHotEncoding. How can I handle this problem?
Training Set
+-------+----------+
| Value | Category |
+-------+----------+
| 100 | SE1 |
+-------+----------+
| 200 | SE2 |
+-------+----------+
| 300 | SE3 |
+-------+----------+
Training set after OneHotEncoding
+-------+-----------+-----------+-----------+
| Value | DummyCat1 | DummyCat2 | DummyCat3 |
+-------+-----------+-----------+-----------+
| 100 | 1 | 0 | 0 |
+-------+-----------+-----------+-----------+
| 200 | 0 | 1 | 0 |
+-------+-----------+-----------+-----------+
| 300 | 0 | 0 | 1 |
+-------+-----------+-----------+-----------+
Test Set
+-------+----------+
| Value | Category |
+-------+----------+
| 100 | SE1 |
+-------+----------+
| 200 | SE1 |
+-------+----------+
| 300 | SE2 |
+-------+----------+
Test set after OneHotEncoding
+-------+-----------+-----------+
| Value | DummyCat1 | DummyCat2 |
+-------+-----------+-----------+
| 100 | 1 | 0 |
+-------+-----------+-----------+
| 200 | 1 | 0 |
+-------+-----------+-----------+
| 300 | 0 | 1 |
+-------+-----------+-----------+
As you can notice, the training set after the OneHotEncoding is of shape (3,4)
while the test set after OneHotEncoding is of shape (3,3)
.
Because of this, when I do the following code (y_train
is a column vector of shape (3,)
)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)
x_pred = regressor.predict(x_test)
I get the error at the predict function. As you can see, the dimensions in the error are quite large, unlike the basic examples.
Traceback (most recent call last):
File "<ipython-input-2-5bac76b24742>", line 30, in <module>
x_pred = regressor.predict(x_test)
File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 256, in predict
return self._decision_function(X)
File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 241, in _decision_function
dense_output=True) + self.intercept_
File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/extmath.py", line 140, in safe_sparse_dot
return np.dot(a, b)
ValueError: shapes (4801,2236) and (4033,) not aligned: 2236 (dim 1) != 4033 (dim 0)
x_train
andx_test
. If the factors in your training data have more levels, either represent them explicitly in your test data or drop the ones fromx_train
that cannot be applied to your test set. – Paxx_train
? – Tuberx_train.drop(x_train.columns[~x_train.columns.isin(x_test.columns)], 1)
should work. – Paxtransform(x_test)
. I'm assuming you are currently usingfit_transform()
on test data, but itfit()
orfit_transform()
should only be used with train data and onlytransform()
on test data. If you share the code by which you encode the x_train and x_test, I can add an answer to help you. – Labors