Why does one hot encoding improve machine learning performance? [closed]

Asked 4/7, 2013 at 12:4 Answered 19/2, 2020 at 15:9

Solved machine-learning data-mining scikit-learn data-analysis

132

I have noticed that when One Hot encoding is used on a particular data set (a matrix) and used as training data for learning algorithms, it gives significantly better results with respect to prediction accuracy, compared to using the original matrix itself as training data. How does this performance increase happen?

Achieve answered 4/7, 2013 at 12:4 Comment(1)

I’m voting to close this question because Machine learning (ML) theory questions are off-topic on Stack Overflow - gift-wrap candidate for Cross-Validated – Fransiscafransisco 10/2, 2021 at 13:52

261

Many learning algorithms either learn a single weight per feature, or they use distances between samples. The former is the case for linear models such as logistic regression, which are easy to explain.

Suppose you have a dataset having only a single categorical feature "nationality", with values "UK", "French" and "US". Assume, without loss of generality, that these are encoded as 0, 1 and 2. You then have a weight w for this feature in a linear classifier, which will make some kind of decision based on the constraint w×x + b > 0, or equivalently w×x < b.

The problem now is that the weight w cannot encode a three-way choice. The three possible values of w×x are 0, w and 2×w. Either these three all lead to the same decision (they're all < b or ≥b) or "UK" and "French" lead to the same decision, or "French" and "US" give the same decision. There's no possibility for the model to learn that "UK" and "US" should be given the same label, with "French" the odd one out.

By one-hot encoding, you effectively blow up the feature space to three features, which will each get their own weights, so the decision function is now w[UK]x[UK] + w[FR]x[FR] + w[US]x[US] < b, where all the x's are booleans. In this space, such a linear function can express any sum/disjunction of the possibilities (e.g. "UK or US", which might be a predictor for someone speaking English).

Similarly, any learner based on standard distance metrics (such as k-nearest neighbors) between samples will get confused without one-hot encoding. With the naive encoding and Euclidean distance, the distance between French and US is 1. The distance between US and UK is 2. But with the one-hot encoding, the pairwise distances between [1, 0, 0], [0, 1, 0] and [0, 0, 1] are all equal to √2.

This is not true for all learning algorithms; decision trees and derived models such as random forests, if deep enough, can handle categorical variables without one-hot encoding.

Baroscope answered 4/7, 2013 at 12:20 Comment(9)

Thanks for this Lars, but when we do a OneHotEncoding which is effectively increase the number of features, do we not need to increase the samples too, to make sure it does not overfit. – Perorate 11/9, 2014 at 14:58

@Perorate Compared to the obvious alternative representation of categorical variables, encoding each level as a distinct integer, I don't think it matters: you need sufficient statistics either way. – Baroscope 20/10, 2014 at 8:54

Isn't this resulting in a linear model that is not identifiable? All columns for a categorical feature sum to one, for all features, so how can you interpret the weights? – Aureaaureate 23/7, 2015 at 9:3

Is there any literature you could point to so I could read further into this? Thanks. – Silverplate 17/8, 2015 at 5:36

Is there a benefit to using a less than full rank matrix (which you wouldn't do when building a regular statistical model) when employing machine learning techniques such as boosting? – Vieira 15/10, 2015 at 19:14

I get how one hot encoding allows us to pick any two of the three countries (for instance) by using just one linear check, but how would we ever know which 2? In the other method, if I found 2 countries less than a certain b,i would know those 2 were UK and French. – Braunschweig 14/10, 2016 at 10:58

This looks to me just like what a statistician would call "dummy variables." But maybe there is some saving of storage space. – Commemoration 4/1, 2017 at 12:9

One Hot Vector encoding allows you to convert a bunch of categorical variables and represent them as binary choices, no matter what type your categories are, your representation will always be binary. – Raimund 11/5, 2017 at 15:46

hi,thanks for your answer, can you explain your final sentence

This is not true for all learning algorithms; decision trees and derived models such as random forests, if deep enough, can handle categorical variables without one-hot encoding.

in detail? I want to know why tree-based models do not need one hot encoding. – Stringhalt 17/7, 2017 at 8:38

Regarding the increase of the features by doing one-hot-encoding one can use feature hashing. When you do hashing, you can specify the number of buckets to be much less than the number of the newly introduced features.

Willable answered 16/7, 2015 at 21:8 Comment(0)

When you want to predict categories, you want to predict items of a set. Not using one-hot encoding is akin to letting the categories have neighbour categories (e.g.: if you did a regression with the integers of the categories instead) organized in a certain way and in a certain order.

Now, what happens if you assign category 0 to 0, category 1 to 1, and category 2 to 2 without one-hot encoding, and that your algorithm's prediction isn't sure if it should choose 0 or 2: should he predict 1 despite he thinks it's either 0 or 2?

You see where it goes. The same goes for your data inputs: if they shouldn't be supposed to be neighbours, then don't show them to your algorithm as neighbours.

Manure answered 19/2, 2020 at 15:9 Comment(0)

Recommended topics

Hot tags