Understanding FeatureHasher, collisions and vector size trade-off
Asked Answered
P

1

11

I'm preprocessing my data before implementing a machine learning model. Some of the features are with high cardinality, like country and language.

Since encoding those features as one-hot-vector can produce sparse data, I've decided to look into the hashing trick and used python's category_encoders like so:

from category_encoders.hashing import HashingEncoder
ce_hash = HashingEncoder(cols = ['country'])
encoded = ce_hash.fit_transform(df.country)
encoded['country'] = df.country
encoded.head()

When looking at the result, I can see the collisions

    col_0  col_1  col_2  col_3  col_4  col_5  col_6  col_7 country
0       0      0      1      0      0      0      0      0      US <━┓
1       0      1      0      0      0      0      0      0      CA.  ┃ US and SE collides 
2       0      0      1      0      0      0      0      0      SE <━┛
3       0      0      0      0      0      0      1      0      JP

Further investigation lead me to this Kaggle article. The example of Hashing there include both X and y.

  • What is the purpose of y, does it help to fight the collision problem?
  • Should I add more columns to the encoder and encode more than one feature together (for example country and language)?

Will appreciate an explanation of how to encode such categories using the hashing trick.

Update: Based on the comments I got from @CoMartel, Iv'e looked at Sklearn FeatureHasher and written the following code to hash the country column:

from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=10,input_type='string')
f = h.transform(df.country)
df1 = pd.DataFrame(f.toarray())
df1['country'] = df.country
df1.head()

And got the following output:

     0    1    2    3    4    5    6    7    8    9 country
0 -1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 -1.0  0.0      US
1 -1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 -1.0  0.0      US
2 -1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 -1.0  0.0      US
3  0.0 -1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0      CA
4  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0 -1.0  0.0      SE
5  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0      JP
6 -1.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0      AU
7 -1.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0      AU
8 -1.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0      DK
9  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0 -1.0  0.0      SE
  • Is that the way to use the library in order to encode high categorical values?
  • Why are some values negative?
  • How would you choose the "right" n_features value?
  • How can I check the collisions ratio?
Pentlandite answered 2/12, 2020 at 12:46 Comment(2)
Looking at source code, y only seems to exist to maintain compatibility with sklearn. Note that your exemple is 2y old, and sklearn integrated its own FeatureHasher. y is also not used. Simple example : from sklearn.feature_extraction import FeatureHasher h = FeatureHasher(n_features=15) f = h.fit_transform(df[['country']].to_dict(orient='records')) f.toarray()Paperboard
Adding more columns didn't work on my example (collisions even with n_cols > cardinality), but encoding 2 columns at the same time seems to correct collisions in my case. I tried with : df = pd.DataFrame([_ for _ in 'abcdefghij'],columns=['country']) Second column to group-encode : df['language'] = [_ for _ in 'abcdefghij'[::-1]]Paperboard
C
5

Is that the way to use the library in order to encode high categorical values?

Yes. There is nothing wrong with your implementation.

You can think about the hashing trick as a "reduced size one-hot encoding with a small risk of collision, that you won't need to use if you can tolerate the original feature dimension".

This idea was first introduced by Kilian Weinberger. You can find in their paper the whole analysis of the algorithm theoretically and practically/empirically.


Why are some values negative?

To avoid collision, a signed hash function is used. That is, the strings are hashed by using the usual hash function first (e.g. a string is converted to its corresponding numerical value by summing ASCII value of each char, then modulo n_feature to get an index in (0, n_features]). Then another single-bit output hash function is used. The latter produces +1 or -1 by definition, where it's added to the index resulted from the first hashing function.

Pseudo code (it looks like Python, though):

def hash_trick(features, n_features):
     for f in features:
         res = np.zero_like(features)
         h = usual_hash_function(f) # just the usual hashing
         index = h % n_features  # find the modulo to get index to place f in res
         if single_bit_hash_function(f) == 1:  # to reduce collision
             res[index] += 1
         else:
             res[index] -= 1 # <--- this will make values to become negative

     return res 

How would you choose the "right" n_features value?

As a rule of thumb, and as you can guess, if we hash an extra feature (i.e. #n_feature + 1), the collision is certainly going to happen. Hence, the best case-scenario is when each feature is mapped to a unique hash value -- hopefully. In this case, logically speaking, n_features should be at least equal to the actual number of features/categories (in your particular case, the number of different countries). Nevertheless, please remember that this is the "best" case scenario, which is not the case "mathematically speaking". Hence, the higher the better of course, but how high? See next.


How can I check the collisions ratio?

If we ignore the second single-bit hash function, the problem is reduced to something called "Birthday problem for Hashing".

This is a big topic. For a comprehensive introduction to this problem, I recommend you read this, and for some detailed math, I recommend this answer.

In a nutshell, what you need to know is that, the probability of no collisions is exp(-1/2) = 60.65%, that means there is approximately 39.35% chance of one collision, at least, to happen.

So, as a rule of thumb, if we have X countries, there is about 40% chance, for at least one collision, if the hash function output range (i.e. n_feature parameter) is X^2. In other words, there is 40% chance of collision if the number of countries in your example = square_root(n_features). As you increase n_features exponentially, the chances of collision is reduced by half. (personally, if it is not for security purposes, but just a plain conversion from string to numbers, it is not worth going too high).

Side-note for curios readers: For a large enough hash function output size(e.g. 256 bits), the chances an attacker guess (or avail of) the collision is almost impossible (from a security perspective).


Regarding the y parameter, as you've already got in a comment, it is just for compatibility purpose, not used (scikit-learn has this along many other implementations).

Clientele answered 8/12, 2020 at 20:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.