How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?
Asked Answered
T

3

43

I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering, so I'm going to stick to that name. The current method used by the system I'm on is K-means, but that seems like overkill.

Is there a better way of performing this task?

Answers to some other posts are mentioning KDE (Kernel Density Estimation), but that is a density estimation method, how would that work?

I see how KDE returns a density, but how do I tell it to split the data into bins?

How do I have a fixed number of bins independent of the data (that's one of my requirements) ?

More specifically, how would one pull this off using scikit learn?

My input file looks like:

 str ID     sls
 1           10
 2           11 
 3            9
 4           23
 5           21
 6           11  
 7           45
 8           20
 9           11
 10          12

I want to group the sls number into clusters or bins, such that:

Cluster 1: [10 11 9 11 11 12] 
Cluster 2: [23 21 20] 
Cluster 3: [45] 

And my output file will look like:

 str ID     sls    Cluster ID  Cluster centroid
    1        10       1               10.66
    2        11       1               10.66
    3         9       1               10.66 
    4        23       2               21.33   
    5        21       2               21.33
    6        11       1               10.66
    7        45       3               45
    8        20       2               21.33
    9        11       1               10.66 
    10       12       1               10.66
Telmatelo answered 29/1, 2016 at 21:35 Comment(8)
What is the concern with k-means? Performance?Throwaway
kmeans is more efficient than kdeDiphenylamine
@DavidMaust 1) When I tried running sklearn's k-means on univariate data, I started getting errors. I had to trick it by having it cluster on 2d data which was identical copies of the original 1d data. 2) According to this post it's a bad idea.Telmatelo
@Diphenylamine see my reply to David Maust.Telmatelo
Have you tried writing some code?Saturation
If you have a fixed number of bins, isn't this the same as just generating a histogram of the data and specifying the number or size of the bins? You might check out numpy.histogram docs.scipy.org/doc/numpy-1.10.1/reference/generated/….Flatfoot
Histogram would have a fixed bin width, I need the bin width to be variable.Telmatelo
I had the same task and again 1-d array and i just used the pandas qcut() - pandas.pydata.org/pandas-docs/stable/reference/api/…Desirae
S
101

Write code yourself. Then it fits your problem best!

Boilerplate: Never assume code you download from the net to be correct or optimal... make sure to fully understand it before using it.

%matplotlib inline

from numpy import array, linspace
from sklearn.neighbors import KernelDensity
from matplotlib.pyplot import plot

a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(a)
s = linspace(0,50)
e = kde.score_samples(s.reshape(-1,1))
plot(s, e)

enter image description here

from scipy.signal import argrelextrema
mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]
print "Minima:", s[mi]
print "Maxima:", s[ma]
> Minima: [ 17.34693878  33.67346939]
> Maxima: [ 10.20408163  21.42857143  44.89795918]

Your clusters therefore are

print a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]]
> [10 11  9 11 11 12] [23 21 20] [45]

and visually, we did this split:

plot(s[:mi[0]+1], e[:mi[0]+1], 'r',
     s[mi[0]:mi[1]+1], e[mi[0]:mi[1]+1], 'g',
     s[mi[1]:], e[mi[1]:], 'b',
     s[ma], e[ma], 'go',
     s[mi], e[mi], 'ro')

enter image description here

We cut at the red markers. The green markers are our best estimates for the cluster centers.

Saturation answered 2/2, 2016 at 11:16 Comment(9)
I would be hesitant to call this method better than k-means. It does involve selecting an arbitrary bandwidth and then calculating 50 density estimates. That being said, I don't know if there is a better way to do it with kernel density estimation.Throwaway
You don't have to know k. You not only get better centers (less affected by outliers) but also sound splitting points (not only at half the way). There is plenty of literature on the bandwidth such as Silverman's rule. Also. who cares about computing 50 density estimates? You could precompute the kernel and do this in a fast convolution.Saturation
I will also add that this is a particularly fast, non-linear scaling method to 1D clustering.Bagatelle
hi I have posted a question about this answer, could you pls help me about it? #60355997Diarrhea
There is a slight error in this accepted aswer (I can't comment previously due to my rank). See my answer below.Leund
would you mind re-rendering your plots without transparency in the axis? they're hard to see in darkmodePloughshare
I was getting the error: ModuleNotFoundError: No module named 'sklearn.neighbors.kde'. In version 0.24.2 at least, it seems from sklearn.neighbors.kde import KernelDensity must be replaced by from sklearn.neighbors import KernelDensityBandstand
@HasQUIT--Anony-Mousse I do care about 50 density estimation. I have 50,000 tests running in parallel probably 500,000 times a day. So, I do care and I think your answer is sub-optimal time-wise.Gibson
Please, may I know why do we used s = linspace(0,50)Gallinacean
L
17

There is a little error in the accepted answer by @Has QUIT--Anony-Mousse (I can't comment nor suggest an edit due my reputation).

The line:

print(a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]])

Should be edited into:

print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a <= s[mi][1])], a[a >= s[mi][1]])

That's because mi and ma is an index, where s[mi] and s[ma] is the value. If you use mi[0] as the limit, you risk and error splitting if your upper and lower linspace >> your upper and lower data. For example, run this code and see the difference in split result:

import numpy as np
from numpy import array, linspace
from sklearn.neighbors import KernelDensity
from matplotlib.pyplot import plot
from scipy.signal import argrelextrema

a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(a)
s = linspace(0,100)
e = kde.score_samples(s.reshape(-1,1))
mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]

print('Grouping by HAS QUIT:')
print(a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]])
print('Grouping by yasirroni:')
print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a < s[mi][1])], a[a >= s[mi][1]])

result:

Grouping by Has QUIT:
[] [10 11  9 11 11 12] [23 21 45 20]
Grouping by yasirroni:
[10 11  9 11 11 12] [23 21 20] [45]
Leund answered 9/10, 2020 at 0:7 Comment(7)
@Yassirroni - nice answer. but for this, do we have normalize/standardize our input data? or I can pass the raw data as isPotence
Or not required as we are doing each variable one by one?Potence
@TheGreat you can settings the kde on KernelDensity. Anyway, I suggest you to look at Python pakcages called jenkspy first.Leund
Sure but can i check whether we have to standardize/normalize our data for 1d variable clusterinf?Potence
This is the realed post - stats.stackexchange.com/questions/576982/…Potence
what is the purpose of s here ?Aha
@Aha x-axis, where we sampling.Leund
S
3

Further improving the responses above by @yasirroni, to dynamically print all clusters (not just 3 from the above) the line:

print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a <= s[mi][1])], a[a >= s[mi][1]])

can be changed into:

print(a[a < s[mi][0]])  # print most left cluster

# print all middle cluster
for i_cluster in range(len(mi)-1):
    print(a[(a >= s[mi][i_cluster]) * (a <= s[mi][i_cluster+1])])

print(a[a >= s[mi][-1]])  # print most right cluster

This would ensure that all the clusters are taken into account.

Stepparent answered 22/6, 2022 at 18:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.