How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

Asked 29/1, 2016 at 21:35 Answered 22/6, 2022 at 18:9

Solved machine-learning scikit-learn cluster-analysis data-mining kernel-density

I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering, so I'm going to stick to that name. The current method used by the system I'm on is K-means, but that seems like overkill.

Is there a better way of performing this task?

Answers to some other posts are mentioning KDE (Kernel Density Estimation), but that is a density estimation method, how would that work?

I see how KDE returns a density, but how do I tell it to split the data into bins?

How do I have a fixed number of bins independent of the data (that's one of my requirements) ?

More specifically, how would one pull this off using scikit learn?

My input file looks like:

 str ID     sls
 1           10
 2           11 
 3            9
 4           23
 5           21
 6           11  
 7           45
 8           20
 9           11
 10          12

I want to group the sls number into clusters or bins, such that:

Cluster 1: [10 11 9 11 11 12] 
Cluster 2: [23 21 20] 
Cluster 3: [45]

And my output file will look like:

 str ID     sls    Cluster ID  Cluster centroid
    1        10       1               10.66
    2        11       1               10.66
    3         9       1               10.66 
    4        23       2               21.33   
    5        21       2               21.33
    6        11       1               10.66
    7        45       3               45
    8        20       2               21.33
    9        11       1               10.66 
    10       12       1               10.66

Telmatelo answered 29/1, 2016 at 21:35 Comment(8)

What is the concern with k-means? Performance? – Throwaway 29/1, 2016 at 23:56

kmeans is more efficient than kde – Diphenylamine 30/1, 2016 at 0:21

@DavidMaust 1) When I tried running sklearn's k-means on univariate data, I started getting errors. I had to trick it by having it cluster on 2d data which was identical copies of the original 1d data. 2) According to this post it's a bad idea. – Telmatelo 30/1, 2016 at 0:23

@Diphenylamine see my reply to David Maust. – Telmatelo 30/1, 2016 at 0:23

Have you tried writing some code? – Saturation 2/2, 2016 at 9:6

If you have a fixed number of bins, isn't this the same as just generating a histogram of the data and specifying the number or size of the bins? You might check out numpy.histogram docs.scipy.org/doc/numpy-1.10.1/reference/generated/…. – Flatfoot 23/2, 2016 at 13:44

Histogram would have a fixed bin width, I need the bin width to be variable. – Telmatelo 24/2, 2016 at 0:55

I had the same task and again 1-d array and i just used the pandas qcut() - pandas.pydata.org/pandas-docs/stable/reference/api/… – Desirae 3/8, 2020 at 8:48

101

Write code yourself. Then it fits your problem best!

Boilerplate: Never assume code you download from the net to be correct or optimal... make sure to fully understand it before using it.

%matplotlib inline

from numpy import array, linspace
from sklearn.neighbors import KernelDensity
from matplotlib.pyplot import plot

a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(a)
s = linspace(0,50)
e = kde.score_samples(s.reshape(-1,1))
plot(s, e)

from scipy.signal import argrelextrema
mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]
print "Minima:", s[mi]
print "Maxima:", s[ma]
> Minima: [ 17.34693878  33.67346939]
> Maxima: [ 10.20408163  21.42857143  44.89795918]

Your clusters therefore are

print a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]]
> [10 11  9 11 11 12] [23 21 20] [45]

and visually, we did this split:

plot(s[:mi[0]+1], e[:mi[0]+1], 'r',
     s[mi[0]:mi[1]+1], e[mi[0]:mi[1]+1], 'g',
     s[mi[1]:], e[mi[1]:], 'b',
     s[ma], e[ma], 'go',
     s[mi], e[mi], 'ro')

We cut at the red markers. The green markers are our best estimates for the cluster centers.

Saturation answered 2/2, 2016 at 11:16 Comment(9)

I would be hesitant to call this method better than k-means. It does involve selecting an arbitrary bandwidth and then calculating 50 density estimates. That being said, I don't know if there is a better way to do it with kernel density estimation. – Throwaway 8/2, 2016 at 2:56

You don't have to know k. You not only get better centers (less affected by outliers) but also sound splitting points (not only at half the way). There is plenty of literature on the bandwidth such as Silverman's rule. Also. who cares about computing 50 density estimates? You could precompute the kernel and do this in a fast convolution. – Saturation 8/2, 2016 at 6:29

I will also add that this is a particularly fast, non-linear scaling method to 1D clustering. – Bagatelle 14/12, 2016 at 19:15

hi I have posted a question about this answer, could you pls help me about it? #60355997 – Diarrhea 22/2, 2020 at 18:40

There is a slight error in this accepted aswer (I can't comment previously due to my rank). See my answer below. – Leund 20/2, 2021 at 4:40

would you mind re-rendering your plots without transparency in the axis? they're hard to see in darkmode – Ploughshare 6/8, 2021 at 16:43

I was getting the error: ModuleNotFoundError: No module named 'sklearn.neighbors.kde'. In version 0.24.2 at least, it seems from sklearn.neighbors.kde import KernelDensity must be replaced by from sklearn.neighbors import KernelDensity – Bandstand 5/11, 2021 at 14:45

@HasQUIT--Anony-Mousse I do care about 50 density estimation. I have 50,000 tests running in parallel probably 500,000 times a day. So, I do care and I think your answer is sub-optimal time-wise. – Gibson 6/3, 2023 at 20:43

Please, may I know why do we used s = linspace(0,50) – Gallinacean 27/9, 2023 at 8:58

There is a little error in the accepted answer by @Has QUIT--Anony-Mousse (I can't comment nor suggest an edit due my reputation).

The line:

print(a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]])

Should be edited into:

print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a <= s[mi][1])], a[a >= s[mi][1]])

That's because mi and ma is an index, where s[mi] and s[ma] is the value. If you use mi[0] as the limit, you risk and error splitting if your upper and lower linspace >> your upper and lower data. For example, run this code and see the difference in split result:

import numpy as np
from numpy import array, linspace
from sklearn.neighbors import KernelDensity
from matplotlib.pyplot import plot
from scipy.signal import argrelextrema

a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(a)
s = linspace(0,100)
e = kde.score_samples(s.reshape(-1,1))
mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]

print('Grouping by HAS QUIT:')
print(a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]])
print('Grouping by yasirroni:')
print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a < s[mi][1])], a[a >= s[mi][1]])

result:

Grouping by Has QUIT:
[] [10 11  9 11 11 12] [23 21 45 20]
Grouping by yasirroni:
[10 11  9 11 11 12] [23 21 20] [45]

Leund answered 9/10, 2020 at 0:7 Comment(7)

@Yassirroni - nice answer. but for this, do we have normalize/standardize our input data? or I can pass the raw data as is – Potence 29/5, 2022 at 2:8

Or not required as we are doing each variable one by one? – Potence 29/5, 2022 at 2:10

@TheGreat you can settings the kde on KernelDensity. Anyway, I suggest you to look at Python pakcages called jenkspy first. – Leund 29/5, 2022 at 4:6

Sure but can i check whether we have to standardize/normalize our data for 1d variable clusterinf? – Potence 29/5, 2022 at 4:31

This is the realed post - stats.stackexchange.com/questions/576982/… – Potence 29/5, 2022 at 4:32

what is the purpose of s here ? – Aha 24/1, 2023 at 3:9

@Aha x-axis, where we sampling. – Leund 13/9, 2023 at 9:29

Further improving the responses above by @yasirroni, to dynamically print all clusters (not just 3 from the above) the line:

print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a <= s[mi][1])], a[a >= s[mi][1]])

can be changed into:

print(a[a < s[mi][0]])  # print most left cluster

# print all middle cluster
for i_cluster in range(len(mi)-1):
    print(a[(a >= s[mi][i_cluster]) * (a <= s[mi][i_cluster+1])])

print(a[a >= s[mi][-1]])  # print most right cluster

This would ensure that all the clusters are taken into account.

Stepparent answered 22/6, 2022 at 18:9 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags