sklearn Hierarchical Agglomerative Clustering using similarity matrix
Asked Answered
M

3

12

Given a distance matrix, with similarity between various professors :

              prof1     prof2     prof3
       prof1     0        0.8     0.9
       prof2     0.8      0       0.2
       prof3     0.9      0.2     0

I need to perform hierarchical clustering on this data, where the above data is in the form of 2-d matrix

       data_matrix=[[0,0.8,0.9],[0.8,0,0.2],[0.9,0.2,0]]

I tried checking if I can implement it using sklearn.cluster AgglomerativeClustering but it is considering all the 3 rows as 3 separate vectors and not as a distance matrix. Can it be done using this or scipy.cluster.hierarchy?

Maneating answered 16/11, 2017 at 3:34 Comment(0)
K
18

Yes, you can do it with sklearn. You need to set:

  • affinity='precomputed', to use a matrix of distances
  • linkage='complete' or 'average', because default linkage(Ward) works only on coordinate input.

With precomputed affinity, input matrix is interpreted as a matrix of distances between observations. The following code

from sklearn.cluster import AgglomerativeClustering
data_matrix = [[0,0.8,0.9],[0.8,0,0.2],[0.9,0.2,0]]
model = AgglomerativeClustering(affinity='precomputed', n_clusters=2, linkage='complete').fit(data_matrix)
print(model.labels_)

will return labels [1 0 0]: the 1st professor goes to one cluster, and the 2nd and 3rd - to another.

Kev answered 16/11, 2017 at 15:16 Comment(3)
Thanks, makes more sense now.Maneating
From: scikit-learn.org/stable/modules/generated/… : "If “precomputed”, a distance matrix (instead of a similarity matrix) is needed as input for the fit method." Does that mean that the values from the input similarity matrix should be inverted, when passed to the model?Councilman
No, you dont need to inverse the matrix , just do 1-similarity (for all i,j) ; when you form the matrix.Arsenate
L
2

You can also do it with scipy.cluster.hierarchy:

from scipy.cluster.hierarchy import dendrogram, linkage, cut_tree
from matplotlib import pyplot as plt

# Data
X =[[0,0.8,0.9],[0.8,0,0.2],[0.9,0.2,0]]
labels = ['prof1','prof2','prof3']

# Perform clustering, you can choose the method
# in this case, we use 'ward'
Z = linkage(X, 'ward')

# Extract the membership to a cluster, either specify the n_clusters
# or the cut height
# (similar to sklearn labels)
print(cut_tree(Z, n_clusters=2))

# Visualize the clustering as a dendogram
fig = plt.figure(figsize=(25, 10))
dn = dendrogram(Z, orientation='right', labels=labels)
plt.show()

This will print :

[[0]
 [1]
 [1]]

Since we specified n_cluster 2 this means there are 2 clusters. prof1 belongs to cluster 0, and prof2 and prof3 belong to cluster 1. You could also indicate a cut_height instead of the number of clusters. The dendogram looks like this:

![two dendogram][1] https://imgur.com/EF0cW4U.png "dendogram"

Locative answered 2/9, 2020 at 2:22 Comment(0)
R
1

The input data_matrix here, must be a distance matrix unlike the similarity matrix which is given and because both are quite the opposite of metrics and using one in place of others would produce quite of arbitrary results. Check the official document [If “precomputed”, a distance matrix (instead of a similarity matrix) is needed as input for the fit method." ]: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

As a solution, one could use similarity = 1 - Distance Matrix (given the distance matrix is normalized between 0 and 1) and then use it as in input.

I have tried it on a few examples and validated the same, thus should do the job.

Rani answered 2/5, 2020 at 22:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.