I'm exploring the possibility of clustering some categorial data with python. I have currently 8 features each with approximately 3-10 levels.
As I understood both one-hot encoding with kmeans and kmodes can be used in this framework, with kmeans getting maybe not-ideal with huge combinations of features/levels due to curse of dimensionality problems.
Is this correct?
At the moment I would follow the kmeans route because it would give me the flexibility to throw in some numerical features as well and computing the silhouette statistic and assessing the optimal number of clusters seems to be much easier.
Does this make sense? Do you have any suggestion on situations in which one approach should be preferred over the other?
Thanks