choosing bandwidth&linspace for kernel density estimation. (why my bandwidth doesn't work?)
Asked Answered
L

2

2

I have followed this link for the application of kernel density estimation. My aim is creating two different groups/clusters or more for an array group. The below code works for every members of array group except this array:

X = np.array([[77788], [77793],[77798], [77803], [92886], [92891], [92896], [92901]])

So my expectation is seeing two different clusters such as:

first_group = ([[77788], [77793],[77798], [77803]])

second_group = ([[92886], [92891], [92896], [92901]])

I have a dynamic list, so I can not fix a value for linspace. Because this array may be 0to 10 or 100000 to 2000000. That's why I have put max and min points of the array in the linspace.

After all, I could not obtain different clusters even though I tried various bandwidths. My code can be seen below:

a = X.reshape(-1,1)
kde = KernelDensity(kernel='gaussian', bandwidth=8).fit(a)
s = linspace(min(a),max(a))
e = kde.score_samples(s.reshape(-1,1))
plot(s, e)

enter image description here

mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]
print("Minima:", s[mi])  # output: []
print("Maxima:", s[ma])  # output: []

s[mi] and s[ma] values are empty which means there is no two different clusters for this array. In the visualization can be seen that we have at least one minimum point. why can not be seen this value for the s[mi] output?

And I applied the same code for different bandwidths which can be seen below, however, there is no minimum or maximum values for this cluster. so any idea what am I doing wrong?

bandwidth=0.008

enter image description here

bandwidth = 0.00002

enter image description here

Livy answered 22/2, 2020 at 18:39 Comment(0)
J
3

you may consider trying grid search :

params = {'bandwidth': np.logspace(-1, 1, 20)}
grid   = GridSearchCV(KernelDensity(), params)
grid.fit(a)

print("best bandwidth: {0}".format(grid.best_estimator_.bandwidth))

kde = grid.best_estimator_
Jayjaycee answered 23/7, 2020 at 13:57 Comment(0)
U
1

Try a bandwidth of 10000, or try relying on heuristics for choosing the bandwidth.

To make your code more robusty also split clusters at consecutive minima. Because your problem is that there is no unique minimum here, but an interval.

Underexposure answered 24/2, 2020 at 9:29 Comment(1)
sorry I don't understand what should I do to make my code more robusty? could you give me an example? @Has QUIT--Anony-MousseLivy

© 2022 - 2024 — McMap. All rights reserved.