Choosing the number of clusters in heirarchical agglomerative clustering with scikit

Asked 26/8, 2015 at 9:18 Answered 26/8, 2015 at 13:15

Solved machine-learning scikit-learn artificial-intelligence cluster-analysis unsupervised-learning

The wikipedia article on determining the number of clusters in a dataset indicated that I do not need to worry about such a problem when using hierarchical clustering. However when I tried to use scikit-learn's agglomerative clustering I see that I have to feed it the number of clusters as a parameter "n_clusters" - without which I get the hardcoded default of two clusters. How can I go about choosing the right number of cluster's for the dataset in this case? Is the wiki article wrong?

Byre answered 26/8, 2015 at 9:18 Comment(2)

Good question. When I had a similar problem, I ended up using scipy routines for hierarchical clustering, visualizing the tree and then "manually"(after having a look at the tree) setting a cut threshold. – Lives 26/8, 2015 at 10:9

seems to relate to github.com/scikit-learn/scikit-learn/issues/3796 – Mangrum 26/8, 2015 at 10:54

Wikipedia is simply making an extreme simplification which has nothing to do with real life. Hierarchical clustering does not avoid the problem with number of clusters. Simply - it constructs the tree spaning over all samples, which shows which samples (later on - clusters) merge together to create a bigger cluster. This happend recursively till you have just two clusters (this is why default number of clusters is 2) which are merged to the whole dataset. You are left alone with "cutting" through the tree to get actual clustering. Once you fit AgglomerativeClustering you can traverse the whole tree and analyze which clusters to keep

import numpy as np
from sklearn.cluster import AgglomerativeClustering
import itertools

X = np.concatenate([np.random.randn(3, 10), np.random.randn(2, 10) + 100])
clustering = AgglomerativeClustering()
clustering.fit(X)

[{'node_id': next(itertools.count(X.shape[0])), 'left': x[0], 'right':x[1]} for x in clustering.children_]

Dancer answered 26/8, 2015 at 10:11 Comment(0)

ELKI (not scikit-learn, but Java) has a number of advanced methods that extract clusters from a hierarchical clustering. They are smarter than just cutting the tree at a particular height, but they can produce a hierarchy of clusters of a minimum size, for example.

You could check if these methods work for you.

Sextan answered 26/8, 2015 at 13:15 Comment(0)

Recommended topics

Hot tags