How to set a minimum number of observations per clusters in k-means clustering?

Asked 1/5, 2019 at 0:51 Answered 14/5, 2021 at 15:58

pandas machine-learning scikit-learn cluster-analysis k-means

I am trying to cluster some products based on the users' behaviors. What I reach at the end are clusters that have a very different number of observations.

I have checked k-means clustering parameters and was not able to find a parameter that controls the minimum (or maximum) number of observations per cluster.

For example here is how the number of observations is distributed across different clusters.

cluster_id   num_observations
0   6
1   4
2   1
3   3
4   29
5   5

How to deal with this issue?

Pharisaism answered 1/5, 2019 at 0:51 Comment(3)

How are you calculating the clusters? By definition of knn but putting a size on the number of observations you can have in each group your results will be bias and the results could be incorrect, especially if you plan and using the model on real data – Wavemeter 1/5, 2019 at 1:14

This might be a good sign that you should select less clusters for your KMeans! – Wandie 1/5, 2019 at 2:4

I'm not sure why you'd want to do this, and if you do, it's not k-means clustering, but here's a thought: Do k-means clustering, then, for clusters below the size minimum, find the nearest neighbor to the cluster center that is NOT already in the cluster, and move it there. Repeat. I don't know, however, how to interpret what that would really mean. – Propagandist 1/5, 2019 at 2:15

For those who still looking for an answer. I found a good module or this module that deal with this kind of problem

Use pip install size-constrained-clustering or pip install git+https://github.com/jingw2/size_constrained_clustering.git and use MinMaxKMeansMinCostFlow where you can select the size_min and size_max

n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
model = minmax.MinMaxKMeansMinCostFlow(n_clusters, size_min=400,   size_max=800)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

Clifford answered 18/9, 2020 at 6:12 Comment(0)

This will solve by k-means-constrained pip library.. check here

Example:

>>> from k_means_constrained import KMeansConstrained
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...                [4, 2], [4, 4], [4, 0]])
>>> clf = KMeansConstrained(
...     n_clusters=2,
...     size_min=2,
...     size_max=5,
...     random_state=0
... )
>>> clf.fit_predict(X)
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> clf.cluster_centers_
array([[ 1.,  2.],
       [ 4.,  2.]])
>>> clf.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)

Chaiken answered 14/5, 2021 at 15:58 Comment(0)

Recommended topics

Hot tags