How do I use use scikit LabelEncoder for new labels?

Asked 3/8, 2017 at 22:8 Answered 12/10, 2018 at 11:38

pandas machine-learning scikit-learn sklearn-pandas scikits

So my code like is:

>>> le = preprocessing.LabelEncoder()
>>> le.fit(train["capital city"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1])
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

But what if in my test dataset, I has something like "beijing" but "beijing" does not exist in the training set? Is there a way for the encoder to handle this without adding in every possible capital city in the globe?

Lxx answered 3/8, 2017 at 22:8 Comment(0)

For a real world scenario, where all you have is training data and new classes can come up later, you can try my solution:

le.classes_ = np.append(le.classes_, "new_class_name")
le.transform(new_y)

Midwinter answered 12/10, 2018 at 11:38 Comment(1)

This works only when the new_class_name would come at the end of all classes, when python does a string sort. Ex: if a,b,d are existing classes, a new class of e would work as expected. But if new class = c, this doesn't work. – Pedrick 8/1, 2020 at 11:41

You can pass a total list of df['capital city'] to the LabelEncoder.fit() before splitting the dataframe df into train and test.

For example, if df is like this:

df['capital city'] = ['amsterdam', 'paris', 'tokyo', 'beijing', 'tokyo', 'newyork', 'paris']

Then, you can use:

le = preprocessing.LabelEncoder();
le.fit(df['capital city'])

le.classes_
Output: ['amsterdam', 'beijing', 'newyork', 'paris', 'tokyo']

Then use transform() on train and test data to convert them to integers correctly.

train["capital city integers"] = le.transform(train["capital city"])
test["capital city integers"] = le.transform(test["capital city"])

Hope this helps.

Note: Although the above given siggestion will work for you and is perfectly acceptable when you are learning, but you should consider about the real world scenarios when employing this for real tasks. Because in real world, all od your available data will be training data (so you use and encode the capital cities), and then new data may come which contains a never before seen capital city value. What would you like to do in that case?

Crispin answered 4/8, 2017 at 1:33 Comment(2)

Uh that's the problem. I am using this in the real world, and I believe your given suggestion isn't scalable. I am hoping this PR gets merged. Otherwise, I will try to implement my own method in how to handle new categorical information. – Lxx 4/8, 2017 at 16:54

@Lxx . Yes, thats what I asked. Because in real world, if you train your data on one and unseen data comes at prediction time, it may not give good results on i, may fail altogether. – Crispin 5/8, 2017 at 14:37

you can try solution from "sklearn.LabelEncoder with never seen before values" https://mcmap.net/q/203680/-sklearn-labelencoder-with-never-seen-before-values

Popelka answered 9/1, 2018 at 14:13 Comment(0)

Recommended topics

Hot tags