Kernel Density Estimation using scipy's gaussian_kde and sklearn's KernelDensity leads to different results
Asked Answered
B

3

6

I created some data from two superposed normal distributions and then applied sklearn.neighbors.KernelDensity and scipy.stats.gaussian_kde to estimate the density function. However, using the same bandwith (1.0) and the same kernel, both methods produce a different outcome. Can someone explain me the reason for this? Thanks for help.

Below you can find the code to reproduce the issue:

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
import seaborn as sns
from sklearn.neighbors import KernelDensity

n = 10000
dist_frac = 0.1
x1 = np.random.normal(-5,2,int(n*dist_frac))
x2 = np.random.normal(5,3,int(n*(1-dist_frac)))
x = np.concatenate((x1,x2))
np.random.shuffle(x)
eval_points = np.linspace(np.min(x), np.max(x))

kde_sk = KernelDensity(bandwidth=1.0, kernel='gaussian')
kde_sk.fit(x.reshape([-1,1]))
y_sk = np.exp(kde_sk.score_samples(eval_points.reshape(-1,1)))

kde_sp = gaussian_kde(x, bw_method=1.0)
y_sp = kde_sp.pdf(eval_points)

sns.kdeplot(x)
plt.plot(eval_points, y_sk)
plt.plot(eval_points, y_sp)
plt.legend(['seaborn','scikit','scipy'])

scipy and scikit with bandwith=1.0

If I change the scipy bandwith to 0.25, the result of both methods look approximately the same.

scipy bandwidth=0.25 and scikit bandwith=1.0

Bolling answered 15/7, 2021 at 15:11 Comment(0)
C
6

What is meant by bandwidth in scipy.stats.gaussian_kde and sklearn.neighbors.KernelDensity is not the same. Scipy.stats.gaussian_kde uses a bandwidth factor https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html. For a 1-D kernel density estimation the following formula is applied:

the bandwidth of sklearn.neighbors.KernelDensity = bandwidth factor of the scipy.stats.gaussian_kde * standard deviation of the sample

For your estimation this probably means that your standard deviation equals 4.

I would like to refer to Getting bandwidth used by SciPy's gaussian_kde function for more information.

Catalinacatalo answered 16/7, 2021 at 11:8 Comment(0)
C
1

To be honest, I don't know why, but using scipy hyperparameter bw_method='scott' makes it work exactly the same as seaborn.

So, it seems to be all about the hyperparameters. We could find out why by understanding them in depth, but in the meantime just use ‘scott’ or ‘silverman’ instead of using a random scalar.

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
import seaborn as sns
from sklearn.neighbors import KernelDensity

n = 10000
dist_frac = 0.1
x1 = np.random.normal(-5,2,int(n*dist_frac))
x2 = np.random.normal(5,3,int(n*(1-dist_frac)))
x = np.concatenate((x1,x2))
np.random.shuffle(x)
eval_points = np.linspace(np.min(x), np.max(x))

kde_sk = KernelDensity(bandwidth=1, kernel='gaussian')
kde_sk.fit(x.reshape([-1,1]))
y_sk = np.exp(kde_sk.score_samples(eval_points.reshape(-1,1)))

kde_sp = gaussian_kde(x, bw_method='scott')     ### I MEAN HERE! ###
y_sp = kde_sp.pdf(eval_points)

sns.kdeplot(x)
plt.plot(eval_points, y_sk)
plt.plot(eval_points, y_sp)
plt.legend(['seaborn','scikit','scipy'])
Coroner answered 7/11, 2022 at 15:28 Comment(1)
By default seaborn uses scipy.stats.gaussian_kde(bw_method='scott', ...), which is why your results match.Azzieb
E
-3

Increase the size of 'random normal'. your data points are too few. try with n=500000 and check the results.

Emblazonry answered 18/1, 2022 at 19:52 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Misfit

© 2022 - 2024 — McMap. All rights reserved.