If a sklearn.LabelEncoder
has been fitted on a training set, it might break if it encounters new values when used on a test set.
The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to "<unknown>"
, and then explicitly add a corresponding class to the LabelEncoder
afterward:
# train and test are pandas.DataFrame's and c is whatever column
le = LabelEncoder()
le.fit(train[c])
test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s)
le.classes_ = np.append(le.classes_, '<unknown>')
train[c] = le.transform(train[c])
test[c] = le.transform(test[c])
This works, but is there a better solution?
Update
As @sapo_cosmico points out in a comment, it seems that the above doesn't work anymore, given what I assume is an implementation change in LabelEncoder.transform
, which now seems to use np.searchsorted
(I don't know if it was the case before). So instead of appending the <unknown>
class to the LabelEncoder
's list of already extracted classes, it needs to be inserted in sorted order:
import bisect
le_classes = le.classes_.tolist()
bisect.insort_left(le_classes, '<unknown>')
le.classes_ = le_classes
However, as this feels pretty clunky all in all, I'm certain there is a better approach for this.