I have recently authored a new hierarchical clustering algorithm specifically for 1D data. I think it would be suitable to the case in your question. It is called hlust1d
and it is written in R
. It is available under this link.
The algorithm addresses @Anony-Mousse's concern: it takes advantages of the particular structure of 1D data and runs in O(n*log n) time. This is much faster than general-purpose hierarchical clustering algorithms.
To segment your data into 3 bins (clusters) as you require in your question, you could run the following code
library(hclust1d)
dendrogram <- hclust1d(c(1, 1, 2, 3, 10, 11, 13, 67, 71))
plot(dendrogram)
Now, having a look at the dendrogram tree one can see that cutting at the height of aprox. 10 results in the segmentation required.
cutree(dendrogram, h=10)
# 1 1 2 3 10 11 13 67 71
# 1 1 1 1 2 2 2 3 3
The clustering is hierarchical, meaning that what you get is a hierarchy of clusters (a dendrogram) and it is up to you to decide, which number of clusters fits your particular data best. It is both an advantage (flexibility) and disadvantage (you have to decide something) of this method. For instance, another good cut of the dendrogram, as can be seen on a plot, is located somewhere between the heights of 20 and 60, resulting in two clusters:
cutree(dendrogram, h=20)
# 1 1 2 3 10 11 13 67 71
# 1 1 1 1 1 1 1 2 2
For more complex data, you could also experiment with other linkage functions using the method
argument, like this:
dendrogram <- hclust1d(c(1, 1, 2, 3, 10, 11, 13, 67, 71), method="ward.D2")
Generally, Ward linkage method would give you a clustering similar to K-Means (the loss function in both methods is the same with Ward hierarchical clustering being a greedy implementation) but the hierarchical clustering let's you decide what is an appropriate number of clusters with a dendrogram in front of you, while you have to provide that number for K-Means up front.
The list of all supported linkage methods can be read with a use of
> supported_methods()
# [1] "complete" "average" "centroid" "true_median" "median" "mcquitty" "ward.D" "ward.D2" "single"