SciKit-Learn Label Encoder resulting in error 'argument must be a string or number'
Asked Answered
F

2

11

I'm a bit confused - creating an ML model here.

I'm at the step where I'm trying to take categorical features from a "large" dataframe (180 columns) and one-hot them so that I can find the correlation between the features and select the "best" features.

Here is my code:

# import labelencoder
from sklearn.preprocessing import LabelEncoder

# instantiate labelencoder object
le = LabelEncoder()

# apply le on categorical feature columns
df = df.apply(lambda col: le.fit_transform(col))
df.head(10)

When running this I get the following error:

TypeError: ('argument must be a string or number', 'occurred at index LockTenor')

So I head over to the LockTenor field and look at all the distinct values:

df.LockTenor.unique()

this results in the following:

array([60.0, 45.0, 'z', 90.0, 75.0, 30.0], dtype=object)

looks like all strings and numbers to me. Is the error caused because it's a float and not necessarily an INT?

Fatma answered 14/11, 2019 at 23:47 Comment(1)
Hi there. What happens if you change df.apply(lambda col: le.fit_transform(col)) to df.apply(lambda col: LabelEncoder().fit_transform(col))? I wonder if your encoder is getting confused with the subsequent fit_transform calls because it's not being re-initialised.Protolanguage
I
16

You get this error because indeed you have a combination of floats and strings. Take a look at this example:

# Preliminaries
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Create DataFrames

# df1 has all floats
d1 = {'LockTenor':[60.0, 45.0, 15.0, 90.0, 75.0, 30.0]}
df1 = pd.DataFrame(data=d1)
print("DataFrame 1")
print(df1)

# df2 has a string in the mix
d2 = {'LockTenor':[60.0, 45.0, 'z', 90.0, 75.0, 30.0]}
df2 = pd.DataFrame(data=d2)
print("DataFrame 2")
print(df2)

# Create encoder
le = LabelEncoder()

# Encode first DataFrame 1 (where all values are floats)
df1 = df1.apply(lambda col: le.fit_transform(col), axis=0, result_type='expand')
print("DataFrame 1 encoded")
print(df1)

# Encode first DataFrame 2 (where there is a combination of floats and strings)
df2 = df2.apply(lambda col: le.fit_transform(col), axis=0, result_type='expand')
print("DataFrame 2 encoded")
print(df2)

If you run this code, you will see that df1 is encoded with no problem, since all its values are floats. However, you will get the error that you are reporting for df2.

An easy fix, is to cast the column to a string. You can do this in the corresponding lambda function:

df2 = df2.apply(lambda col: le.fit_transform(col.astype(str)), axis=0, result_type='expand')

As an additional suggestion, I would recommend you take a look at your data and see if they are correct. For me, it is a bit weird having a mix of floats and strings in the same column.

Finally, I would just like to point out that sci-kit's LabelEncoder performs a simple encoding of variables, it does not performe one-hot encoding. If you wish to do so, I recommend you take a look at OneHotEncoder

Ilarrold answered 15/11, 2019 at 0:51 Comment(3)
thank you this worked! the z is my NAN replacement :)Fatma
Interesting choice. I would suggest that you use numpy's np.nan or pandas' pd.NA (from 1.0 on). This way, you can use more functions that easily handle missing values (such as fillnaIlarrold
Hi, I have a similar problem. If you have time, can request your help with this related post? #71194240Astrophotography
F
5

Try with this:

df[cat] = le.fit_transform(df[cat].astype(str))
Frigidarium answered 27/3, 2021 at 10:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.