Getting bandwidth used by SciPy's gaussian_kde function
Asked Answered
T

4

5

I'm using SciPy's stats.gaussian_kde function to generate a kernel density estimate (kde) function from a data set of x,y points.

This is a simple MWE of my code:

import numpy as np
from scipy import stats

def random_data(N):
    # Generate some random data.
    return np.random.uniform(0., 10., N)

# Data lists.
x_data = random_data(100)
y_data = random_data(100)

# Obtain the gaussian kernel.
kernel = stats.gaussian_kde(np.vstack([x_data, y_data]))

Since I'm not setting a bandwidth manually (via the bw_method key), the function defaults to using Scott's rule (see function's description). What I need is to obtain this bandwidth value set automatically by the stats.gaussian_kde function.

I've tried using:

print kernel.set_bandwidth()

but it always returns None instead of a float.

Twofaced answered 13/5, 2014 at 11:52 Comment(0)
A
17

Short answer

The bandwidth is kernel.covariance_factor() multiplied by the std of the sample that you are using.

(This is in the case of 1D sample and it is computed using Scott's rule of thumb in the default case).

Example:

from scipy.stats import gaussian_kde
sample = np.random.normal(0., 2., 100)
kde = gaussian_kde(sample)
f = kde.covariance_factor()
bw = f * sample.std()

The pdf that you get is this:

from pylab import plot
x_grid = np.linspace(-6, 6, 200)
plot(x_grid, kde.evaluate(x_grid))

enter image description here

You can check it this way, If you use a new function to create a kde using, say, sklearn:

from sklearn.neighbors import KernelDensity
def kde_sklearn(x, x_grid, bandwidth):
    kde_skl = KernelDensity(bandwidth=bandwidth)
    kde_skl.fit(x[:, np.newaxis])
    # score_samples() returns the log-likelihood of the samples
    log_pdf = kde_skl.score_samples(x_grid[:, np.newaxis])
    pdf = np.exp(log_pdf)
    return pdf

Now using the same code from above you get:

plot(x_grid, kde_sklearn(sample, x_grid, f))

enter image description here

plot(x_grid, kde_sklearn(sample, x_grid, bw))

enter image description here

Audi answered 8/3, 2016 at 17:55 Comment(6)
see my updated comment. what you are using is not the bandwidth, it is the factor to be multiplied with the std of the sampleAudi
Are you aware that the arguments bw_method in scipy.stats.gaussian_kde and bandwidth in sklearn.neighbors.KernelDensity do not represent the same thing? Please see this question: https://mcmap.net/q/1492099/-relation-between-2d-kde-bandwidth-in-sklearn-vs-bandwidth-in-scipy/1391441Twofaced
You were asking for the bandwidth, weren't you? covariance_factor is not the bandwidthAudi
I call it 'bandwidth' because the docs call bw_method "The method used to calculate the estimator bandwidth". You are probably right statistically speaking, although I'm not sure what your answer is supposed to do. Are you saying my accepted answer is not correct?Twofaced
Well it depends what you ask. If you ask how to get the factor, then your answer is right. But if you want to know what the "bandwidth" you'll have to use what I wrote. "Bandwidth" is a technical term for a parameter used by kernel-density-estimators (you can check it here en.wikipedia.org/wiki/Kernel_density_estimation). If you give bw-method a scalar, it is used to calculate the bandwidth in the manner written above (it is multiplied by the std), but it is not the bandwidth itself.Audi
Thanks so much, I was missing the multiplication by the STD and this is not at all clear from the Scipy documentation.Outwork
T
1

I've got it, the line is:

kernel.covariance_factor()

From scipy.stats.gaussian_kde.covariance_factor:

Computes the coefficient (kde.factor) that multiplies the data covariance matrix to obtain the kernel covariance matrix. The default is scotts_factor. A subclass can overwrite this method to provide a different method, or set it through a call to kde.set_bandwidth.

One can check that the resulting kernel using this bandwidth value is equivalent to the kernel generated using the default bandwidth. To do this obtain a new kernel with the bandwidth given by covariance_factor(), and compare its value on a random point with the original kernel:

kernel = stats.gaussian_kde(np.vstack([x_data, y_data]))
print kernel([0.5, 1.3])

bw = kernel.covariance_factor()    
kernel2 = stats.gaussian_kde(np.vstack([x_data, y_data]), bw_method=bw)
print kernel2([0.5, 1.3])
Twofaced answered 13/5, 2014 at 12:12 Comment(1)
bw_method parameter is used as the factor, therefore setting the factor will give you the same results. but this is not the "bandwidth". I added an answerAudi
A
1

Just another small comment to add to the already very excellent posted answer by @nivniv.

I was trying to reproduce scipy.stats.gaussian_kde from scratch, but using a stupidly small number of samples (something probably not very realistic) because I wanted to get a better sense of trying to understand how KDE works. My implementation was not lining up with scipy's at first, but then I computed the standard deviation with a degree of freedom of 1, and that made the difference.

# Implementation from scratch
import numpy as np
np.random.seed(12)
Xgrid = np.linspace(-6, 6, 200)
sample = np.random.rand(3) # only 3 samples

scottFactor = len(sample)**(-1/5)
bw = scottFactor * sample.std(ddof=1)
pde_manual = 0.
for s in sample:
    _in = (s - Xgrid) / bw
    pde_manual = pde_manual + _kernel(_in) / bw
pde_manual = pde_manual / len(sample)

To see what I mean, here's the constructing the kde using scipy,

# Scipy
from scipy.stats import gaussian_kde
k_scipy = gaussian_kde(sample, bw_method="scott")
pde_scipy = np.reshape(k_scipy(Xgrid).T, Xgrid.shape)

and plotting them ...

import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2, figsize=(11, 5), sharey=True)
ax[0].plot(Xgrid, pde_manual)
ax[0].set_title("Implementation from scratch")

ax[1].plot(Xgrid, pde_scipy)
ax[1].set_title("scipy.stats.gaussian_kde")

enter image description here

and just to show what it looks like if ddof=1 is not set (ddof=0 is the default according to the documentation from NumPy) ...

enter image description here

However, the ddof would not matter much anymore with more samples.

Aramanta answered 8/3 at 10:53 Comment(0)
S
0

I came across this old question since I was also interested in knowing what was the bandwidth used for gaussian_kde from Scipy. I would like to add/amend previous answers that the covariance factor is used in kde.py code from Scipy as: self.covariance = self._data_covariance * self.factor**2

Therefore the full kernel covariance is the sample covariance times the square of the so-called covariance factor (Scott factor) which can be retrieved by kde.factor or kde.covariance_factor().

Slabber answered 31/12, 2018 at 12:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.