sklearn.LabelEncoder with never seen before values
Asked Answered
A

14

106

If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set.

The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to "<unknown>", and then explicitly add a corresponding class to the LabelEncoder afterward:

# train and test are pandas.DataFrame's and c is whatever column
le = LabelEncoder()
le.fit(train[c])
test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s)
le.classes_ = np.append(le.classes_, '<unknown>')
train[c] = le.transform(train[c])
test[c] = le.transform(test[c])

This works, but is there a better solution?

Update

As @sapo_cosmico points out in a comment, it seems that the above doesn't work anymore, given what I assume is an implementation change in LabelEncoder.transform, which now seems to use np.searchsorted (I don't know if it was the case before). So instead of appending the <unknown> class to the LabelEncoder's list of already extracted classes, it needs to be inserted in sorted order:

import bisect
le_classes = le.classes_.tolist()
bisect.insort_left(le_classes, '<unknown>')
le.classes_ = le_classes

However, as this feels pretty clunky all in all, I'm certain there is a better approach for this.

Algia answered 11/1, 2014 at 1:54 Comment(2)
https://mcmap.net/q/205523/-getting-valueerror-y-contains-new-labels-when-using-scikit-learn-39-s-labelencoderMalmo
The majority of the highly rated answers are outdated, @Algia see my answer, as of version 0.24 this use case is natively supportedSuicide
S
54

As of scikit-learn 0.24.0 you shouldn't have to use LabelEncoder on your features (and should use OrdinalEncoder), hence its name LabelEncoder.

Since models will never predict a label that wasn't seen in their training data, LabelEncoder should never support an unknown label.

For features though, it's different as obviously you might encounter different categories never seen in the training set. In version 0.24.0 scikit-learn presented two new arguments to the OrdinalEncoder that allows it to encode unknown categories.

An example usage of OrdinalEncoder to encode features, and converting unknown categories to the value -1

from sklearn.preprocessing import OrdinalEncoder

# Create encoder
ordinal_encoder = OrdinalEncoder(handle_unknown='use_encoded_value',
                                 unknown_value=-1)

# Fit on training data
ordinal_encoder.fit(np.array([1,2,3,4,5]).reshape(-1, 1))

# Transform, notice that 0 and 6 are values that were never seen before
ordinal_encoder.transform(np.array([0,1,2,3,4,5,6]).reshape(-1, 1))

Output:

array([[-1.],
       [ 0.],
       [ 1.],
       [ 2.],
       [ 3.],
       [ 4.],
       [-1.]])
Suicide answered 2/1, 2021 at 10:40 Comment(1)
This is the actual answer to this question.Odyl
T
52

LabelEncoder is basically a dictionary. You can extract and use it for future encoding:

from sklearn.preprocessing import LabelEncoder

le = preprocessing.LabelEncoder()
le.fit(X)

le_dict = dict(zip(le.classes_, le.transform(le.classes_)))

Retrieve label for a single new item, if item is missing then set value as unknown

le_dict.get(new_item, '<Unknown>')

Retrieve labels for a Dataframe column:

df[your_col] = df[your_col].apply(lambda x: le_dict.get(x, <unknown_value>))
Tumescent answered 25/9, 2018 at 19:21 Comment(0)
H
46

I ended up switching to Pandas' get_dummies due to this problem of unseen data.

  • create the dummies on the training data
    dummy_train = pd.get_dummies(train)
  • create the dummies in the new (unseen data)
    dummy_new = pd.get_dummies(new_data)
  • re-index the new data to the columns of the training data, filling the missing values with 0
    dummy_new.reindex(columns = dummy_train.columns, fill_value=0)

Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.

Hitt answered 17/11, 2015 at 15:53 Comment(8)
Instead of dummies.columns, do you mean dummy_train.columns?Ebby
@KevinMarkham kudos to you Sir, caught a bug that had been there for almost a year :)Hitt
When saving (pickle) the model, do you save dummy_train.columns into its own file?Susuable
@Susuable generally I'll use it in a pipeline object. I can't say I know enough about pickling, I generally avoid it, but would venture a guess that the state in the pipeline should hold and keep those columnsHitt
@Susuable in my case, I saved the columns in the same file as the model. Just make sure you write and read in the same order!Battles
@Hitt I am trying to build a transformer following you answer but I get an error. How do you use it inside the sklearn pipeline?Copenhaver
@Copenhaver I do indeed, what's your error? PS: if you think it can help others make it a separate question and paste the link here :)Hitt
I had some issues with the index but I've found a workaround ;) Thank you!Copenhaver
A
37

I have created a class to support this. If you have a new label comes, this will assign it as unknown class.

from sklearn.preprocessing import LabelEncoder
import numpy as np


class LabelEncoderExt(object):
    def __init__(self):
        """
        It differs from LabelEncoder by handling new classes and providing a value for it [Unknown]
        Unknown will be added in fit and transform will take care of new item. It gives unknown class id
        """
        self.label_encoder = LabelEncoder()
        # self.classes_ = self.label_encoder.classes_

    def fit(self, data_list):
        """
        This will fit the encoder for all the unique values and introduce unknown value
        :param data_list: A list of string
        :return: self
        """
        self.label_encoder = self.label_encoder.fit(list(data_list) + ['Unknown'])
        self.classes_ = self.label_encoder.classes_

        return self

    def transform(self, data_list):
        """
        This will transform the data_list to id list where the new values get assigned to Unknown class
        :param data_list:
        :return:
        """
        new_data_list = list(data_list)
        for unique_item in np.unique(data_list):
            if unique_item not in self.label_encoder.classes_:
                new_data_list = ['Unknown' if x==unique_item else x for x in new_data_list]

        return self.label_encoder.transform(new_data_list)

The sample usage:

country_list = ['Argentina', 'Australia', 'Canada', 'France', 'Italy', 'Spain', 'US', 'Canada', 'Argentina, ''US']

label_encoder = LabelEncoderExt()

label_encoder.fit(country_list)
print(label_encoder.classes_) # you can see new class called Unknown
print(label_encoder.transform(country_list))


new_country_list = ['Canada', 'France', 'Italy', 'Spain', 'US', 'India', 'Pakistan', 'South Africa']
print(label_encoder.transform(new_country_list))
Amr answered 3/7, 2019 at 18:50 Comment(4)
How do we get access to encoder.classes and inverse_transform with this modified class?Kimble
Same question here.Quote
@SandeepNalla and @ah25 , to get classes use label_enoder.classes_ or label_encoder.label_encoder.classes_Amr
great answer... I was able to speed it up by making a try-except statement in the transform function instead of checking for unknowns every time. Additionally, I added an inverse_transform function that would just return self.label_encoder.inverse_transform(list(data_list))Moonrise
I
11

I recently ran into this problem and was able to come up with a pretty quick solution to the problem. My answer solves a little more than just this problem but it will easily work for your issue too. (I think its pretty cool)

I am working with pandas data frames and originally used the sklearns labelencoder() to encode my data which I would then pickle to use in other modules in my program.

However, the label encoder in sklearn's preprocessing does not have the ability to add new values to the encoding algorithm. I solved the problem of encoding multiple values and saving the mapping values AS WELL as being able to add new values to the encoder by (here's a rough outline of what I did):

encoding_dict = dict()
for col in cols_to_encode:
    #get unique values in the column to encode
    values = df[col].value_counts().index.tolist()

    # create a dictionary of values and corresponding number {value, number}
    dict_values = {value: count for value, count in zip(values, range(1,len(values)+1))}

    # save the values to encode in the dictionary
    encoding_dict[col] = dict_values

    # replace the values with the corresponding number from the dictionary
    df[col] = df[col].map(lambda x: dict_values.get(x))

Then you can simply save the dictionary to a JSON file and are able to pull it and add any value you want by adding a new value and the corresponding integer value.

I'll explain some reasoning behind using map() instead of replace(). I found that using pandas replace() function took over a minute to iterate through around 117,000 lines of code. Using map brought that time to just over 100 ms.

TLDR: instead of using sklearns preprocessing just work with your dataframe by making a mapping dictionary and map out the values yourself.

Interjacent answered 7/8, 2018 at 20:48 Comment(1)
do you happen know if this is faster than defaultdict + label encoder ?Feer
B
8

I get the impression that what you've done is quite similar to what other people do when faced with this situation.

There's been some effort to add the ability to encode unseen labels to the LabelEncoder (see especially https://github.com/scikit-learn/scikit-learn/pull/3483 and https://github.com/scikit-learn/scikit-learn/pull/3599), but changing the existing behavior is actually more difficult than it seems at first glance.

For now it looks like handling "out-of-vocabulary" labels is left to individual users of scikit-learn.

Bole answered 17/6, 2015 at 6:47 Comment(0)
H
6

I know two devs that are working on building wrappers around transformers and Sklearn pipelines. They have 2 robust encoder transformers (one dummy and one label encoders) that can handle unseen values. Here is the documentation to their skutil library. Search for skutil.preprocessing.OneHotCategoricalEncoder or skutil.preprocessing.SafeLabelEncoder. In their SafeLabelEncoder(), unseen values are auto encoded to 999999.

Hartzog answered 22/10, 2017 at 17:19 Comment(3)
Have they not tried to submit to sklearn itself? This is a universal issue. Obviously we parameterize the default_label_value.Ares
Just curious, would there be a benefit at all to making the default -1 instead of 999999? Say for example my categorical has 56 categories, I think I would prefer my labels to be between -1 and 56 instead of 0 through 56 with a 999999 tacked on to the end. Plus if you do the categorical transformation before you scale, then you could squish the numbers on a 0 to 1 scale or properly scale/center them, yea? If you were to use 999999, that would seem to eliminate the option for further processing and potentially add an extremely different magnitude to your feature's scale. Am I overthinking?Jonme
Typically in most of my workflows, unseen values are filtered out of the pipeline during inference/prediction time. So to me it doesn't matter if it is encoded it as -1 or 999999.Hartzog
R
3

Here is with the use of the relatively new feature from pandas. The main motivation is machine learning packages like 'lightgbm' can accept pandas category as feature columns and it is better than using onehotencoding in some situations. And in this example, the transformer return an integer but can also change the date type and replace with the unseen categorical values with -1.

from collections import defaultdict
from sklearn.base import BaseEstimator,TransformerMixin
from pandas.api.types import CategoricalDtype
import pandas as pd
import numpy as np

class PandasLabelEncoder(BaseEstimator,TransformerMixin):
    def __init__(self):
        self.label_dict = defaultdict(list)

    def fit(self, X):
        X = X.astype('category')
        cols = X.columns
        values = list(map(lambda col: X[col].cat.categories, cols))
        self.label_dict = dict(zip(cols,values))
        # return as category for xgboost or lightgbm 
        return self

    def transform(self,X):
        # check missing columns
        missing_col=set(X.columns)-set(self.label_dict.keys())
        if missing_col:
            raise ValueError('the column named {} is not in the label dictionary. Check your fitting data.'.format(missing_col)) 
        return X.apply(lambda x: x.astype('category').cat.set_categories(self.label_dict[x.name]).cat.codes.astype('category').cat.set_categories(np.arange(len(self.label_dict[x.name]))))


    def inverse_transform(self,X):
        return X.apply(lambda x: pd.Categorical.from_codes(codes=x.values,
                                                           categories=self.label_dict[x.name]))

dff1 = pd.DataFrame({'One': list('ABCC'), 'Two': list('bccd')})
dff2 = pd.DataFrame({'One': list('ABCDE'), 'Two': list('debca')})


enc=PandasLabelEncoder()
enc.fit_transform(dff1)
One Two
0   0   0
1   1   1
2   2   1
3   2   2
dff3=enc.transform(dff2)
dff3
    One Two
0   0   2
1   1   -1
2   2   0
3   -1  1
4   -1  -1
enc.inverse_transform(dff3)
One Two
0   A   d
1   B   NaN
2   C   b
3   NaN c
4   NaN NaN
Retina answered 5/12, 2019 at 21:59 Comment(0)
C
2

I was trying to deal with this problem and found two handy ways to encode categorical data from train and test sets with and without using LabelEncoder. New categories are filled with some known cetegory "c" (like "other" or "missing"). First method seems to work faster. Hope that will help you.

import pandas as pd
import time
df=pd.DataFrame()

df["a"]=['a','b', 'c', 'd']
df["b"]=['a','b', 'e', 'd']


#LabelEncoder + map
t=time.clock()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
suf="_le"
col="a"
df[col+suf] = le.fit_transform(df[col])
dic = dict(zip(le.classes_, le.transform(le.classes_)))
col='b'
df[col+suf]=df[col].map(dic).fillna(dic["c"]).astype(int)
print(time.clock()-t)

#---
#pandas category

t=time.clock()
df["d"] = df["a"].astype('category').cat.codes
dic =df["a"].astype('category').cat.categories.tolist()
df['f']=df['b'].astype('category',categories=dic).fillna("c").cat.codes
df.dtypes
print(time.clock()-t)
Cilurzo answered 9/1, 2018 at 13:27 Comment(1)
In the #pandas category approach, the line df['f']=df['b'].astype('category',categories=dic)........ is giving this error: TypeError: astype() got an unexpected keyword argument 'categories'Guileful
D
2

I face the same problem and realized that my encoder was somehow mixing values within my columns dataframe. Lets say that you run your encoder for several columns and when assigning numbers to labels the encoder automatically writes numbers to it and sometimes turns out that you have two different columns with similar values. What I did to solve the problem was to create an instance of LabelEncoder() for each column in my pandas DataFrame and I have a nice result.

encoder1 = LabelEncoder()
encoder2 = LabelEncoder()
encoder3 = LabelEncoder()

df['col1'] = encoder1.fit_transform(list(df['col1'].values))
df['col2'] = encoder2.fit_transform(list(df['col2'].values))
df['col3'] = encoder3.fit_transform(list(df['col3'].values))

Regards!!

Dilorenzo answered 11/10, 2019 at 16:37 Comment(0)
C
1

LabelEncoder() should be used only for target labels encoding. To encode categorical features, use OneHotEncoder(), which can handle unseen values: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

Countersink answered 9/12, 2019 at 11:30 Comment(2)
what if the features have cardinality greater than 10000+ ?Iceman
Depends on the case. Multiple solutions are possible. Maybe you should think about bucketing or embedding. It's difficult to without understanding the real case.Countersink
E
1

If someone is still looking for it, here is my fix.

Say you have
enc_list : list of variables names already encoded
enc_map : the dictionary containing variables from enc_list and corresponding encoded mapping
df : dataframe containing values of a variable not present in enc_map

This will work assuming you already have category "NA" or "Unknown" in the encoded values

for l in enc_list:  

    old_list = enc_map[l].classes_
    new_list = df[l].unique()
    na = [j for j in new_list if j not in old_list]
    df[l] = df[l].replace(na,'NA')
Exsanguinate answered 21/5, 2020 at 17:3 Comment(0)
E
0

I wrote a custome label encoder which is updatable. Here it is

import numpy as np
class LabelEncoderExt():
    def fit(self, dfs):
        self.mapper = {element: code for code, element in enumerate(dfs.unique())}
        self.classes_ = len(self.mapper.keys())

    def transform(self, dfs):
        lst = [self.mapper[element] for element in dfs if element in self.mapper.keys()]    
        return lst

    def fit_transform(self, dfs):
        self.fit(dfs)
        return self.transform(dfs)

    def inverse_transform(self, dfs):
        inv_mapper = {v: k for k, v in self.mapper.items()}
        return [inv_mapper[element] for element in dfs]

    def update_fit_transform(self, dfs):
        for elt in dfs:
            if elt not in self.mapper.keys():
                self.mapper[elt] = len(self.mapper.keys())
        return self.transform(dfs)

    def save(self, path):
        np.save(path, self.mapper)

    def load(self, path):
        self.mapper = np.load(path, allow_pickle=True).item()
        self.classes_ = len(self.mapper.keys())

    def get_classes(self):
        return self.classes_

    def get_mapper(self):
        return self.mapper

Usage of it is like this

import pandas as pd
country_list = ['Argentina', 'Australia', 'Canada', 'France', 'Italy', 'Spain', 'US', 'Canada', 'Argentina, ''US']
new_country_list = ['Canada', 'France', 'Italy', 'Spain', 'US', 'India', 'Pakistan', 'South Africa']
country_df = pd.DataFrame({'country': country_list})

le = LabelEncoderExt()
le.fit(country_df['country'])
le.get_mapper()

le.update_fit_transform(new_country_list)
le.get_mapper()
Edrick answered 21/9, 2023 at 10:16 Comment(0)
I
-4

If it is just about training and testing a model, why not just labelencode on entire dataset. And then use the generated classes from the encoder object.

encoder = LabelEncoder()
encoder.fit_transform(df["label"])
train_y = encoder.transform(train_y)
test_y = encoder.transform(test_y)
Interleave answered 13/7, 2018 at 9:26 Comment(4)
I believe doing this would be an instance of Data Leakage (cardinal ML sin).Algia
This seems to be an excellent solution. As I see, there is no issue of leakage when all we are doing is encoding a variable.Fanfare
For new data, see my solution : #45495808Interleave
This solution works if we have a fixed test data before-hand. However, this is not possible in real life applications where most of the times test data is unknown to us.Dissimilitude

© 2022 - 2024 — McMap. All rights reserved.