kmodes VS one-hot encoding + kmeans for categorical data?
Asked Answered
C

2

6

I'm exploring the possibility of clustering some categorial data with python. I have currently 8 features each with approximately 3-10 levels.

As I understood both one-hot encoding with kmeans and kmodes can be used in this framework, with kmeans getting maybe not-ideal with huge combinations of features/levels due to curse of dimensionality problems.

Is this correct?

At the moment I would follow the kmeans route because it would give me the flexibility to throw in some numerical features as well and computing the silhouette statistic and assessing the optimal number of clusters seems to be much easier.

Does this make sense? Do you have any suggestion on situations in which one approach should be preferred over the other?

Thanks

Centrifugal answered 16/5, 2019 at 15:19 Comment(0)
V
5

There are also variants that use the k-modes approach on the categoricial attributes and the mean on continuous attributes.

K-modes has a big advantage over one-hot+k-means: it is interpretable. Every cluster has one explicit categoricial value for the prototype. With k-means, because of the SSQ objective, the one-hot variables have the smallest errors if they are inbetween values. That is not desirable.

Venom answered 18/5, 2019 at 19:51 Comment(0)
C
9

Refer to this paper by Huang (author of Kmodes). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.134.83&rep=rep1&type=pdf

  1. He mentions that if we use Kmeans + one hot encoding it will increase the size of the dataset extensively if the categorical attributes have a large number of categories. This will make the Kmeans computationally costly. So yes your idea of curse of dimensionality is right.

  2. Also the cluster means will make no sense since the 0 and 1 are not the real values of the data. Kmodes on the other hand produces cluster modes which are the real data and hence make the clusters interpretable.

For your requirement of both numerical and categorical attributes, look at the k-prototypes method which combines kmeans and kmodes with the use of a balancing weight factor. (Again explained in the paper).

Code sample in python

Cnut answered 12/9, 2019 at 15:32 Comment(0)
V
5

There are also variants that use the k-modes approach on the categoricial attributes and the mean on continuous attributes.

K-modes has a big advantage over one-hot+k-means: it is interpretable. Every cluster has one explicit categoricial value for the prototype. With k-means, because of the SSQ objective, the one-hot variables have the smallest errors if they are inbetween values. That is not desirable.

Venom answered 18/5, 2019 at 19:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.