1D Number Array Clustering
Asked Answered
M

7

116

So let's say I have an array like this:

[1,1,2,3,10,11,13,67,71]

Is there a convenient way to partition the array into something like this?

[[1,1,2,3],[10,11,13],[67,71]]

I looked through similar questions yet most people suggested using k-means to cluster points, like scipy, which is quite confusing to use for a beginner like me. Also I think that k-means is more suitable for two or more dimensional clustering right? Are there any ways to partition an array of N numbers to many partitions/clustering depending on the numbers?

Some people also suggest rigid range partitioning, but it doesn't always render the results as expected

Maidy answered 16/7, 2012 at 22:25 Comment(0)
K
156

Don't use multidimensional clustering algorithms for a one-dimensional problem. A single dimension is much more special than you naively think, because you can actually sort it, which makes things a lot easier.

In fact, it is usually not even called clustering, but e.g. segmentation or natural breaks optimization.

You might want to look at Jenks Natural Breaks Optimization and similar statistical methods. Kernel Density Estimation is also a good method to look at, with a strong statistical background. Local minima in density are be good places to split the data into clusters, with statistical reasons to do so. KDE is maybe the most sound method for clustering 1-dimensional data.

With KDE, it again becomes obvious that 1-dimensional data is much more well behaved. In 1D, you have local minima; but in 2D you may have saddle points and such "maybe" splitting points. See this Wikipedia illustration of a saddle point, as how such a point may or may not be appropriate for splitting clusters.

See this answer for an example how to do this in Python (green markers are the cluster modes; red markers a points where the data is cut; the y axis is a log-likelihood of the density):

KDE with Python

Katmandu answered 17/7, 2012 at 5:38 Comment(7)
Implementation here: macwright.org/2013/02/18/literate-jenks.htmlEyeshade
Could you update your answer with why meanshift or dbscan may or may not be good approaches to clustering 1D? See scikit-learn.org/stable/modules/clustering.htmlDiphthongize
Essentially, both are very naive approximations to Kernel Density Estimation. Mean-Shift is a mode-seeking approach for multivariate KDE, and DBSCAN is using the most primitive KDE (box kernel) to define what is dense and what is not. There is 0 benefit to use them on 1-dimensional data.Katmandu
Ckmeans.1d.dp (k-means adapted for dimensional clustering) is worth a look however. See journal.r-project.org/archive/2011-2/…Mlawsky
@Mlawsky that is a slower k-means variant that yields the global optimum (in 1d only). But if the SSQ k-means objective doesn't solve your problem it does not matter if you find a 0.1% better (by SSQ) k-means solution than with the faster standard algorithm.Katmandu
is there a java implementation avaliable?Floweret
You interpretation of clustering is very particular and narrow. Not all 1D data points have distributions like the one you've plotted. Your suggested methods will suffer to cluster heavily tailed data sets, e.g. scree plots of eigenvalue decomposition.Joselow
M
14

This simple algorithm works:

points = [0.1, 0.31,  0.32, 0.45, 0.35, 0.40, 0.5 ]

clusters = []
eps = 0.2
points_sorted = sorted(points)
curr_point = points_sorted[0]
curr_cluster = [curr_point]
for point in points_sorted[1:]:
    if point <= curr_point + eps:
        curr_cluster.append(point)
    else:
        clusters.append(curr_cluster)
        curr_cluster = [point]
    curr_point = point
clusters.append(curr_cluster)
print(clusters)

The above example clusters points into a group, such that each element in a group is at most eps away from another element in the group. This is like the clustering algorithm DBSCAN with eps=0.2, min_samples=1. As others noted, 1d data allows you to solve the problem directly, instead of using the bigger guns like DBSCAN.

The above algorithm is 10-100x faster for some small datasets with <1000 elements I tested.

Mayramays answered 6/7, 2021 at 13:2 Comment(0)
C
4

You may look for discretize algorithms. 1D discretization problem is a lot similar to what you are asking. They decide cut-off points, according to frequency, binning strategy etc.

weka uses following algorithms in its , discretization process.

weka.filters.supervised.attribute.Discretize

uses either Fayyad & Irani's MDL method or Kononeko's MDL criterion

weka.filters.unsupervised.attribute.Discretize

uses simple binning

Cayenne answered 18/7, 2012 at 10:14 Comment(1)
Hi! The link doesn't seem accessible anymore.. do you have another resource please?Smackdab
G
3

CKwrap is a fast and straightforward k-means clustering function, though a bit light on documentation.

Example Usage

pip install ckwrap

import ckwrap

nums= np.array([1,1,2,3,10,11,13,67,71])
km = ckwrap.ckmeans(nums,3)

print(km.labels)
# [0 0 0 0 1 1 1 2 2]


buckets = [[],[],[]]
for i in range(len(nums)):
    buckets[km.labels[i]].append(nums[i])
print(buckets)
# [[1, 1, 2, 3], [10, 11, 13], [67, 71]]
exit()

I expect the authors intended you to make use of the nd array functionality rather than create a list of lists.

other measures:

km.centers
km.k
km.sizes
km.totss
km.betweenss
km.withinss

The underlying algorithm is based on this article.

Giarla answered 20/5, 2021 at 1:30 Comment(2)
Any idea how I get the index of the km.centers in the input dataset?Papyraceous
km.centers[0] corresponds to the first element in the input dataset.Giarla
I
3

I have recently authored a new hierarchical clustering algorithm specifically for 1D data. I think it would be suitable to the case in your question. It is called hlust1d and it is written in R. It is available under this link.

The algorithm addresses @Anony-Mousse's concern: it takes advantages of the particular structure of 1D data and runs in O(n*log n) time. This is much faster than general-purpose hierarchical clustering algorithms.

To segment your data into 3 bins (clusters) as you require in your question, you could run the following code

library(hclust1d)
dendrogram <- hclust1d(c(1, 1, 2, 3, 10, 11, 13, 67, 71))
plot(dendrogram)

the output of plot dendrogram

Now, having a look at the dendrogram tree one can see that cutting at the height of aprox. 10 results in the segmentation required.

cutree(dendrogram, h=10)
# 1  1  2  3 10 11 13 67 71 
# 1  1  1  1  2  2  2  3  3 

The clustering is hierarchical, meaning that what you get is a hierarchy of clusters (a dendrogram) and it is up to you to decide, which number of clusters fits your particular data best. It is both an advantage (flexibility) and disadvantage (you have to decide something) of this method. For instance, another good cut of the dendrogram, as can be seen on a plot, is located somewhere between the heights of 20 and 60, resulting in two clusters:

cutree(dendrogram, h=20)
# 1  1  2  3 10 11 13 67 71 
# 1  1  1  1  1  1  1  2  2 

For more complex data, you could also experiment with other linkage functions using the method argument, like this:

dendrogram <- hclust1d(c(1, 1, 2, 3, 10, 11, 13, 67, 71), method="ward.D2")

Generally, Ward linkage method would give you a clustering similar to K-Means (the loss function in both methods is the same with Ward hierarchical clustering being a greedy implementation) but the hierarchical clustering let's you decide what is an appropriate number of clusters with a dendrogram in front of you, while you have to provide that number for K-Means up front.

The list of all supported linkage methods can be read with a use of

> supported_methods()
# [1] "complete"    "average"     "centroid"    "true_median" "median"      "mcquitty"    "ward.D"      "ward.D2"     "single" 
Intercommunicate answered 27/3 at 13:54 Comment(1)
Well this is interesting and i would like to know if there is a python implementation for the above?Keeshakeeshond
J
1

Late response and just for the record. You can partition a 1D array using Ckmeans.1d.dp.

This method guarantees optimality and it is O(n^2), where n is the num of observations. The implementation is in C++ and there is a wrapper in R.

Jockstrap answered 28/12, 2021 at 18:39 Comment(0)
F
1

The code for Has QUIT--Anony-Mousse's answer to Clustering values by their proximity in python (machine learning?)

When you have 1-dimensional data, sort it, and look for the largest gaps

I only added that gaps need to be relatively large

import numpy as np
from scipy.signal import argrelextrema
# lst = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]
lst = [1,1,2,3,10,11,13,67,71]
lst.sort()
diff = [lst[i] - lst[i-1] for i in range(1, len(lst))]
rel_diff = [diff[i]/lst[i] for i in range(len(diff))]
arg = argrelextrema(np.array(rel_diff), np.greater)[0]
last = 0
for x in arg:
    print(f'{last}:{x + 1} {lst[last:x + 1]}')
    last = x + 1
print(f'{last}: {lst[last:]}')

output:

0:2 [1, 1]
2:4 [2, 3]
4:7 [10, 11, 13]
7: [67, 71]
Functionalism answered 25/9, 2022 at 17:20 Comment(1)
I prefer iterations over list elements to indexes (e.g. [y-x for x, y in zip(lst[:-1], lst[1:])]), but algorithmically a very pragmatic answer. In particular, the complexity is the same as for sort, so O(n log(n)).Rennie

© 2022 - 2024 — McMap. All rights reserved.