Python sklearn - Determine the encoding order of LabelEncoder
Asked Answered
S

4

7

I wish to determine the labels of sklearn LabelEncoder (namely 0,1,2,3,...) to fit a specific order of the possible values of categorical variable (say ['b', 'a', 'c', 'd' ]). LabelEncoder chooses to fit the labels lexicographically I guess as can be seen in this example:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(['b', 'a', 'c', 'd' ])
le.classes_
array(['a', 'b', 'c', 'd'], dtype='<U1')
le.transform(['a', 'b'])
array([0, 1])

How can I force the encoder to stick to the order of data as it is first met in the .fit method (namely to encode 'b' to 0, 'a' to 1, 'c' to 2, and 'd' to 3)?

Saxena answered 12/7, 2018 at 15:4 Comment(1)
I think you need OrdinalEncoder described github.com/scikit-learn-contrib/categorical-encoding and contrib.scikit-learn.org/categorical-encoding/ordinal.htmlRaddle
P
9

You cannot do that in original one.

LabelEncoder.fit() uses numpy.unique which will always return the data as sorted, as given in source:

def fit(...):
    y = column_or_1d(y, warn=True)
    self.classes_ = np.unique(y)
    return self

So if you want to do that, you need to override the fit() function. Something like this:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import column_or_1d

class MyLabelEncoder(LabelEncoder):

    def fit(self, y):
        y = column_or_1d(y, warn=True)
        self.classes_ = pd.Series(y).unique()
        return self

Then you can do this:

le = MyLabelEncoder()
le.fit(['b', 'a', 'c', 'd' ])
le.classes_
#Output:  array(['b', 'a', 'c', 'd'], dtype=object)

Here, I am using pandas.Series.unique(), to get unique classes. If you cannot use pandas for any reason, refer to this question which does this question using numpy:

Pileate answered 12/7, 2018 at 16:43 Comment(1)
If someone sutmbled upon this and is asking why it is not working: I was accidently using le.fit_transform() which does not call the custom fit() method. So check that if it doesnt work for youDialyser
D
2

Note, there is potentially a better way to do this now with http://contrib.scikit-learn.org/categorical-encoding/ordinal.html. In particular, see the mapping parameter:

a mapping of class to label to use for the encoding, optional. the dict contains the keys ‘col’ and ‘mapping’. the value of ‘col’ should be the feature name. the value of ‘mapping’ should be a dictionary of ‘original_label’ to ‘encoded_label’. example mapping: [{‘col’: ‘col1’, ‘mapping’: {None: 0, ‘a’: 1, ‘b’: 2}}]

Duval answered 7/2, 2020 at 21:29 Comment(1)
OLD way, NOT Supported NOW, Link also not workingHeuer
H
2

NOTE :: This is not a standard way but a hacky approach I used 'classes_' attribute to customize my mapping

from sklearn import preprocessing
le_temp = preprocessing.LabelEncoder()
le_temp = le_temp.fit(df_1['Temp'])
print(df_1['Temp'])
le_temp.classes_ = np.array(['Cool', 'Mild','Hot'])
print("New classes sequence::",le_temp.classes_)
df_1['Temp'] = le_temp.transform(df_1['Temp'])
print(df_1['Temp'])

My output Look like

1      Hot
2      Hot
3      Hot
4     Mild
5     Cool
6     Cool

Name: Temp, dtype: object
New classes sequence:: ['Cool' 'Mild' 'Hot']

1     2
2     2
3     2
4     1
5     0
6     0

Name: Temp, dtype: int32
Heuer answered 13/10, 2020 at 13:26 Comment(0)
A
1

Vivek Kumar solution worked for me, but had to do it this way

class LabelEncoder(LabelEncoder):

def fit(self, y):
    y = column_or_1d(y, warn=True)
    self.classes_ = pd.Series(y).unique().sort()
    return self
Antiscorbutic answered 9/12, 2019 at 20:56 Comment(1)
The whole idea of question was to not sort the order of classes. That's why I chose not to do that in my answer.Pileate

© 2022 - 2024 — McMap. All rights reserved.