You can pass a total list of df['capital city']
to the LabelEncoder.fit()
before splitting the dataframe df into train and test.
For example, if df
is like this:
df['capital city'] = ['amsterdam', 'paris', 'tokyo', 'beijing', 'tokyo', 'newyork', 'paris']
Then, you can use:
le = preprocessing.LabelEncoder();
le.fit(df['capital city'])
le.classes_
Output: ['amsterdam', 'beijing', 'newyork', 'paris', 'tokyo']
Then use transform()
on train and test data to convert them to integers correctly.
train["capital city integers"] = le.transform(train["capital city"])
test["capital city integers"] = le.transform(test["capital city"])
Hope this helps.
Note:
Although the above given siggestion will work for you and is perfectly acceptable when you are learning, but you should consider about the real world scenarios when employing this for real tasks. Because in real world, all od your available data will be training data (so you use and encode the capital cities), and then new data may come which contains a never before seen capital city value. What would you like to do in that case?
new_class_name
would come at the end of all classes, when python does a string sort. Ex: if a,b,d are existing classes, a new class ofe
would work as expected. But if new class = c, this doesn't work. – Pedrick