How to get the optimal number of clusters using hierarchical cluster analysis automatically in python?
Asked Answered
O

2

7

I want to use hierarchical cluster analysis to get the optimal number (K) of clusters automatically, then apply this K to K-means clustering in python.

After studying many article, I know some methods tell us that we can plot the graph to determine K, but have any methods can output a real number automatically in python?

Overawe answered 5/6, 2018 at 8:10 Comment(4)
Define "best k" in your caseBurletta
I believe hierarchical clustering is itself a way to cluster values, so it would make no sense to apply a clustering algorithm, find how many cluster it returned then apply another algorithm (K-means in your case). Correct me if I'm wrongMilitiaman
@Burletta It means optimal number of clusters. I edit my question for clarity. Thanks for your suggestion.Overawe
"optimal" by what measure? This is not objective. And the methods for k means are just very crude heuristics, that choose a bad k as often as a good k.Brakesman
L
7

The hierarchical clustering method is based on dendrogram to determine the optimal number of clusters. Plot the dendrogram using a code similar to the following:

# General imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Special imports
from scipy.cluster.hierarchy import dendrogram, linkage

# Load data, fill in appropriately
X = []

# How to cluster the data, single is minimal distance between clusters
linked = linkage(X, 'single')

# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked,
            orientation='top',
            labels=labelList,
            distance_sort='descending',
            show_leaf_counts=True)
plt.show()

In the dendrogram locate the largest vertical difference between nodes, and in the middle pass an horizontal line. The number of vertical lines intersecting it is the optimal number of clusters (when affinity is calculated using the method set in linkage).

See example here: https://stackabuse.com/hierarchical-clustering-with-python-and-scikit-learn/

How to automatically read a dendrogram and extract that number is something I would also like to know.

Added in edit: There is a way to do so using SK Learn package. See the following example:

#==========================================================================
# Hierarchical Clustering - Automatic determination of number of clusters
#==========================================================================

# General imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from os import path

# Special imports
from scipy.cluster.hierarchy import dendrogram, linkage
import scipy.cluster.hierarchy as shc
from sklearn.cluster import AgglomerativeClustering

# %matplotlib inline

print("============================================================")
print("       Hierarchical Clustering demo - num of clusters       ")
print("============================================================")
print(" ")


folder = path.dirname(path.realpath(__file__)) # set current folder

# Load data
customer_data = pd.read_csv( path.join(folder, "hierarchical-clustering-with-python-and-scikit-learn-shopping-data.csv"))
# print(customer_data.shape)
print("In this data there should be 5 clusters...")

# Retain only the last two columns
data = customer_data.iloc[:, 3:5].values

# # Plot dendrogram using SciPy
# plt.figure(figsize=(10, 7))
# plt.title("Customer Dendograms")
# dend = shc.dendrogram(shc.linkage(data, method='ward'))

# plt.show()


# Initialize hiererchial clustering method, in order for the algorithm to determine the number of clusters
# put n_clusters=None, compute_full_tree = True,
# best distance threshold value for this dataset is distance_threshold = 200
cluster = AgglomerativeClustering(n_clusters=None, affinity='euclidean', linkage='ward', compute_full_tree=True, distance_threshold=200)

# Cluster the data
cluster.fit_predict(data)

print(f"Number of clusters = {1+np.amax(cluster.labels_)}")

# Display the clustering, assigning cluster label to every datapoint 
print("Classifying the points into clusters:")
print(cluster.labels_)

# Display the clustering graphically in a plot
plt.scatter(data[:,0],data[:,1], c=cluster.labels_, cmap='rainbow')
plt.title(f"SK Learn estimated number of clusters = {1+np.amax(cluster.labels_)}")
plt.show()

print(" ")

The clustering results

The data was taken from here: https://stackabuse.s3.amazonaws.com/files/hierarchical-clustering-with-python-and-scikit-learn-shopping-data.csv

Levine answered 27/10, 2020 at 15:56 Comment(1)
Great solution! I was wondering how I could automate this process!Webbing
G
0

I found a solution I am using in my code. It involves color_list that counts amount of numbers of "connections". If one wants to extract the number of "leaves" (clusters) just decrease the number by 1: https://www.youtube.com/watch?v=4DInt3H2UNE

Gove answered 12/10, 2022 at 10:8 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Cheapskate

© 2022 - 2024 — McMap. All rights reserved.