I'm preprocessing my data before implementing a machine learning model. Some of the features are with high cardinality, like country and language.
Since encoding those features as one-hot-vector can produce sparse data, I've decided to look into the hashing trick and used python's category_encoders like so:
from category_encoders.hashing import HashingEncoder
ce_hash = HashingEncoder(cols = ['country'])
encoded = ce_hash.fit_transform(df.country)
encoded['country'] = df.country
encoded.head()
When looking at the result, I can see the collisions
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 country
0 0 0 1 0 0 0 0 0 US <━┓
1 0 1 0 0 0 0 0 0 CA. ┃ US and SE collides
2 0 0 1 0 0 0 0 0 SE <━┛
3 0 0 0 0 0 0 1 0 JP
Further investigation lead me to this Kaggle article. The example of Hashing there include both X and y.
- What is the purpose of y, does it help to fight the collision problem?
- Should I add more columns to the encoder and encode more than one feature together (for example country and language)?
Will appreciate an explanation of how to encode such categories using the hashing trick.
Update: Based on the comments I got from @CoMartel, Iv'e looked at Sklearn FeatureHasher and written the following code to hash the country column:
from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=10,input_type='string')
f = h.transform(df.country)
df1 = pd.DataFrame(f.toarray())
df1['country'] = df.country
df1.head()
And got the following output:
0 1 2 3 4 5 6 7 8 9 country
0 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 0.0 US
1 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 0.0 US
2 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 0.0 US
3 0.0 -1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 CA
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -1.0 0.0 SE
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 JP
6 -1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 AU
7 -1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 AU
8 -1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 DK
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -1.0 0.0 SE
- Is that the way to use the library in order to encode high categorical values?
- Why are some values negative?
- How would you choose the "right"
n_features
value? - How can I check the collisions ratio?
y
only seems to exist to maintain compatibility with sklearn. Note that your exemple is 2y old, andsklearn
integrated its own FeatureHasher.y
is also not used. Simple example :from sklearn.feature_extraction import FeatureHasher h = FeatureHasher(n_features=15) f = h.fit_transform(df[['country']].to_dict(orient='records')) f.toarray()
– Paperboarddf = pd.DataFrame([_ for _ in 'abcdefghij'],columns=['country'])
Second column to group-encode :df['language'] = [_ for _ in 'abcdefghij'[::-1]]
– Paperboard