Why shouldn't the sklearn LabelEncoder be used to encode input data?
Asked Answered
B

3

13

The docs for sklearn.LabelEncoder start with

This transformer should be used to encode target values, i.e. y, and not the input X.

Why is this?

I post just one example of this recommendation being ignored in practice, although there seems to be loads more. https://www.kaggle.com/matleonard/feature-generation contains

#(ks is the input data)

# Label encoding
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = ks[cat_features].apply(encoder.fit_transform)
Burse answered 25/1, 2020 at 23:13 Comment(0)
L
3

Maybe because:

  1. It doesn't naturally work on multiple columns at once.
  2. It doesn't support ordering. I.e. if your categories are ordinal, such as:

Awful, Bad, Average, Good, Excellent

LabelEncoder would give them an arbitrary order (probably as they are encountered in the data), which will not help your classifier.

In this case you could use either an OrdinalEncoder or a manual replacement.

1. OrdinalEncoder:

Encode categorical features as an integer array.

df = pd.DataFrame(data=[['Bad', 200], ['Awful', 100], ['Good', 350], ['Average', 300], ['Excellent', 1000]], columns=['Quality', 'Label'])
enc = OrdinalEncoder(categories=[['Awful', 'Bad', 'Average', 'Good', 'Excellent']])  # Use the 'categories' parameter to specify the desired order. Otherwise the ordered is inferred from the data.
enc.fit_transform(df[['Quality']])  # Can either fit on 1 feature, or multiple features at once.

Output:

array([[1.],
       [0.],
       [3.],
       [2.],
       [4.]])

Notice the logical order in the ouput.

2. Manual replacement:

scale_mapper = {'Awful': 0, 'Bad': 1, 'Average': 2, 'Good': 3, 'Excellent': 4}
df['Quality'].replace(scale_mapper)

Output:

0    1
1    0
2    3
3    2
4    4
Name: Quality, dtype: int64
Ladykiller answered 6/3, 2021 at 9:50 Comment(0)
T
2

It is not that big of deal that it changes the output value y because it is only relearn based on that (if it a regression based on error).

The problem if it changes up the weights of the input values “X” that makes it impossible to do correct predictions.

You can do it on the X if there are not many options for example 2 category, 2 currency, 2 city encoded in to int-s does not changes the game too much.

Teresiateresina answered 25/1, 2020 at 23:36 Comment(0)
H
-1

I think they warn from using it for X (input data), because:

  • Categorical input data are better encoded as one hot encoding and not as integers in most of the cases, since mostly you have non-sortable categories.

  • Second, another technical problem will be that LabelEncoder is not programmed to handle tables (column-wise/feature-wise encoding would be necessary for X). LabelEncoder assumes that the data is just a flat list. That will be the problem.

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()

categories = [x for x in 'abcdabaccba']
categories
## ['a', 'b', 'c', 'd', 'a', 'b', 'a', 'c', 'c', 'b', 'a']

categories_numerical = enc.fit_transform(categories)

categories_numerical
# array([0, 1, 2, 3, 0, 1, 0, 2, 2, 1, 0])

# so it makes out of categories numbers
# and can transform back

enc.inverse_transform(categories_numerical)
# array(['a', 'b', 'c', 'd', 'a', 'b', 'a', 'c', 'c', 'b', 'a'], dtype='<U1')
Helmuth answered 25/1, 2020 at 23:47 Comment(3)
Putting aside the conventions for code examples on SO, I don't believe you've addressed the heart of the question, namely 'Why do the doc's say that LabelEncoder should not be used on input data?'Burse
@Burse LabelEncoder should not be used to input data, because categorical input data are better encoded as one hot encoding and not as integers, I would say. Second, the problem will be that LabelEncoder is not programmed to handle tables (and thus encoder column-wise/feature-wise). LabelEncoder assumes that the data is just a flat list. That will be the problem. - Sorry for my tone - maybe you were hurt by it. Sorry. Corrected the answer.Helmuth
Categorical data can be encoded in multiple ways, not just one hot or with ordinal numbers. This is not one of the reasons why sklearn developers do not recommend using the LabelEncoder for predictor variables.Roentgenoscope

© 2022 - 2024 — McMap. All rights reserved.