Kmeans using categorical variables

Asked 12/12, 2019 at 18:9 Answered 12/12, 2019 at 22:14

python machine-learning scikit-learn data-science unsupervised-learning

I have a large data set 45421 * 12 (rows * columns) which contains all categorical variables. There are no numerical variables in my dataset. I would like to use this dataset to build unsupervised clustering model, but before modeling I would like to know the best feature selection model for this dataset. And I am unable to plot elbow curve to this dataset. I am giving range k = 1-1000 in k-means elbow method but it's not giving any optimal clusters plot and taking 8-10 hours to execute. If any one suggests a better solution to this issue it will be a great help.

Code:

data = {'UserName':['infuk_tof', 'infus_llk', 'infaus_kkn', 'infin_mdx'], 
       'UserClass':['high','low','low','medium','high'], 
       'UserCountry':['unitedkingdom','unitedstates','australia','india'], 
       'UserRegion':['EMEA','EMEA','APAC','APAC'], 
       'UserOrganization':['INFBLRPR','INFBLRHC','INFBLRPR','INFBLRHC'], 
       'UserAccesstype':['Region','country','country','region']} 

df = pd.DataFrame(data)

Benefaction answered 12/12, 2019 at 18:9 Comment(3)

Can you give an example of a few rows of your dataset? And are you using scikit-learn for K-means? – Ribonuclease 12/12, 2019 at 18:12

yes . i am using scikit-learn for K-means. these are some rows of my dataset. data = {'UserName':['infuk_tof', 'infus_llk', 'infaus_kkn', 'infin_mdx'], 'UserClass':['high','low','low','medium','high'], 'UserCountry':['unitedkingdom','unitedstates','australia','india'], 'UserRegion':['EMEA','EMEA','APAC','APAC'], 'UserOrganization':['INFBLRPR','INFBLRHC','INFBLRPR','INFBLRHC'] 'UserAccesstype':['Region','country','country','region']} df = pd.DataFrame(data) – Benefaction 12/12, 2019 at 19:14

The use of k-means in a strictly categorical dataset is not the best approach because float values calculated in k-means algorithm actually do not have meaning. I suggest you use mca and then cluster as this article Another alternative to unsupervised clustering of categorical variables is k-modes. The author of k-modes explains better the problems of kmeans for categorical values. – Crofton 4/6, 2022 at 16:26

For categorical data like this, K-means is not the appropriate clustering algorithm. You may want to look for a K-modes method, which unfortunately not currently included in scikit-learn package. You may want to look at this package for kmodes available on github: https://github.com/nicodv/kmodes which follows much of the syntax you're used to from scikit-learn.

For more, please see the discussion here: https://datascience.stackexchange.com/questions/22/k-means-clustering-for-mixed-numeric-and-categorical-data

Ribonuclease answered 12/12, 2019 at 19:37 Comment(0)

-1

To be able to run Kmeans or any other model, you need first to transform the categorical variables into numerical.

Example using OneHotEncoder:

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

data={'UserAccesstype': ['Region', 'country', 'country', 'region'],
 'UserCountry': ['unitedkingdom', 'unitedstates', 'australia', 'india'],
 'UserOrganization': ['INFBLRPR', 'INFBLRHC', 'INFBLRPR', 'INFBLRHC'],
 'UserRegion': ['EMEA', 'EMEA', 'APAC', 'APAC']}

df = pd.DataFrame(data)

  UserAccesstype    UserCountry UserOrganization UserRegion
0         Region  unitedkingdom         INFBLRPR       EMEA
1        country   unitedstates         INFBLRHC       EMEA
2        country      australia         INFBLRPR       APAC
3         region          india         INFBLRHC       APAC

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(df.values)

X_for_Kmeans = enc.transform(df.values).toarray()

X_for_Kmeans
array([[1., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0.]])

Use the X_for_Kmeans for the Kmeans fitting. Cheers

Lustring answered 12/12, 2019 at 22:14 Comment(1)

Just because you can do this doesn't mean that you should. There's no clearly defined metric to define a distance between data points in the categorical space, and this is an active field of research (See here, for example: link.springer.com/article/10.1007/s12652-019-01445-5) – Ribonuclease 12/12, 2019 at 22:38

Recommended topics

Hot tags