sklearn categorical data clustering

Asked 13/11, 2018 at 20:52 Answered 14/11, 2018 at 7:59

Solved python scikit-learn cluster-analysis

I'm using sklearn and agglomerative clustering function. I have a mixed data which includes both numeric and nominal data columns. My nominal columns have values such that "Morning", "Afternoon", "Evening", "Night". If I convert my nominal data to numeric by assigning integer values like 0,1,2,3; euclidean distance will be calculated as 3 between "Night" and "Morning", but, 1 should be return value as a distance.

X = pd.read_csv("mydata.csv", sep=",", header=0, encoding="utf-8")
X = StandardScaler().fit_transform(X)
print("n_samples: %d, n_features: %d" % X.shape)

km = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average')
km.fit(X)

print("k = %d,  Silhouette Coefficient: %0.3f" % (x,
   metrics.silhouette_score(X, km.labels_, sample_size=None)))

Here is my code.

How can I customize the distance function in sklearn or convert my nominal data to numeric?

Recidivate answered 13/11, 2018 at 20:52 Comment(2)

Can you use the built-in sklearn labelencoder? – Cider 13/11, 2018 at 21:4

You actually want to use OneHotEncoder. – Voluptuary 14/11, 2018 at 1:15

I think you have 3 options how to convert categorical features to numerical:

Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".
Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.
Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.

Code:

def two_hot(x):
    return np.concatenate([
        (x == "morning") | (x == "afternoon"),
        (x == "afternoon") | (x == "evening"),
        (x == "evening") | (x == "night"),
        (x == "night") | (x == "morning"),
    ], axis=1).astype(int)

x = np.array([["morning", "afternoon", "evening", "night"]]).T
print(x)
x = two_hot(x)
print(x)

Output:

[['morning']
 ['afternoon']
 ['evening']
 ['night']]
[[1 0 0 1]
 [1 1 0 0]
 [0 1 1 0]
 [0 0 1 1]]

Then we can measure the distances:

from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(x)

Output:

array([[0.        , 1.41421356, 2.        , 1.41421356],
       [1.41421356, 0.        , 1.41421356, 2.        ],
       [2.        , 1.41421356, 0.        , 1.41421356],
       [1.41421356, 2.        , 1.41421356, 0.        ]])

Fawcette answered 14/11, 2018 at 7:59 Comment(2)

While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data. – Nestle 14/11, 2018 at 16:40

You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance. – Fawcette 15/11, 2018 at 6:21

This problem is common to machine learning applications. You need to define one category as the base category (it doesn't matter which) then define indicator variables (0 or 1) for each of the other categories. In other words, create 3 new variables called "Morning", "Afternoon", and "Evening", and assign a one to whichever category each observation has. If it's a night observation, leave each of these new variables as 0.

Nestle answered 13/11, 2018 at 21:12 Comment(0)

Recommended topics

Hot tags