Different integration results using Monte Carlo vs scipy.integrate.nquad
Asked Answered
T

1

4

The MWE below shows two ways of integrating the same 2D kernel density estimate, obtained for this data using the stats.gaussian_kde() function.

The integration is performed for all (x, y) below the threshold point (x1, y1), which defines the upper integration limits (lower integration limits are -infinity; see MWE).

  • The int1 function uses simple a Monte Carlo approach.
  • The int2 function uses the scipy.integrate.nquad function.

The issue is that int1 (ie: the Monte Carlo method) gives systematically larger values for the integral than int2. I don't know why this happens.

Here's an example of the integral values obtained after 200 runs of int1 (blue histogram) versus the integral result given by int2 (red vertical line):

enter image description here

What is the origin of this difference in the resulting integral value?


MWE

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy import integrate


def int1(kernel, x1, y1):
    # Compute the point below which to integrate
    iso = kernel((x1, y1))

    # Sample KDE distribution
    sample = kernel.resample(size=50000)

    # Filter the sample
    insample = kernel(sample) < iso

    # The integral is equivalent to the probability of drawing a
    # point that gets through the filter
    integral = insample.sum() / float(insample.shape[0])

    return integral


def int2(kernel, x1, y1):

    def f_kde(x, y):
        return kernel((x, y))

    # 2D integration in: (-inf, x1), (-inf, y1).
    integral = integrate.nquad(f_kde, [[-np.inf, x1], [-np.inf, y1]])

    return integral


# Obtain data from file.
data = np.loadtxt('data.dat', unpack=True)
# Perform a kernel density estimate (KDE) on the data
kernel = stats.gaussian_kde(data)

# Define the threshold point that determines the integration limits.
x1, y1 = 2.5, 1.5

i2 = int2(kernel, x1, y1)
print i2

int1_vals = []
for _ in range(200):
    i = int1(kernel, x1, y1)
    int1_vals.append(i)
    print i

Add

Notice that this question originated from this answer. At first I didn't notice that the answer was mistaken in the integration limits used, which explains why the results between int1 and int2 are different.

int1 is integrating in the domain f(x,y)<f(x1,y1) (where f is the kernel density estimate), while int2 integrates in the domain (x,y)<(x1,y1).

Trump answered 9/3, 2016 at 21:2 Comment(0)
C
3

You resample the distribution

sample = kernel.resample(size=50000)

and then compute the probability for each sampled point is less than the probability at the bound

insample = kernel(sample) < iso

This is incorrect. Consider the bounds (0,100) and assume your data has u=(0,0) and cov=[[100,0],[0,100]]. Points (0,50) and (50,0) have the same probability in this kernel, but only one of them is in the bounds. Since both pass the test, you are over sampling.

You should be testing whether each point in sample is inside the bounds, then compute the probability. Something like

def int1(kernel, x1, y1):
    # Sample KDE distribution                                                                                                              
    sample = kernel.resample(size=100)

    include = (sample < np.repeat([[x1],[y1]],sample.shape[1],axis=1)).all(axis=0)
    integral = include.sum() / float(sample.shape[1])
    return integral

I tested this using the following code

def measure(n):

    m1 = np.random.normal(size=n)
    m2 = np.random.normal(size=n)
    return m1,m2

a = scipy.stats.gaussian_kde( np.vstack(measure(1000)) )
print(int1(a,-10,-10))
print(int2(a,-10,-10))
print(int1(a,0,0))
print(int2(a,-0,-0))

Yields

0.0
(4.304674927251112e-232, 4.6980863813551415e-230)
0.26
(0.25897626178338407, 1.4536217446381293e-08)

Monte Carlo integration should work like this

  • Sample N random values (uniformly, not from your distribution) over some subset of the possible values of x/y (below I bound it by 10 SDs from mean).
  • For each random value compute kernel(rand_x,rand_y)
  • Compute the sum and multiply by (volume)/N_samples

In code:

def mc_wo_sample(kernel,x1,y1,lboundx,lboundy):
    nsamples = 50000
    volume = (x1-lboundx)*(y1-lboundy)
    # generate uniform points in range                                                                                                     
    xrand = np.random.rand(nsamples,1)*(x1-lboundx) + lboundx
    yrand = np.random.rand(nsamples,1)*(y1-lboundy) + lboundy
    randvals = np.hstack((xrand,yrand)).transpose()
    print randvals.shape
    return (volume*kernel(randvals).sum())/nsamples

Running the following

   print(int1(a,-9,-9))
   print(int2(a,-9,-9))
   print(mc_wo_sample(a,-9,-9,-10,-10))
   print(int1(a,0,0))
   print(int2(a,-0,-0))
   print(mc_wo_sample(a,0,0,-10,-10))

yields

0.0
(4.012958496109042e-70, 6.7211236076277e-71)
4.08538890986e-70
0.36
(0.37101621760650216, 1.4670898180664756e-08)
0.361614657674
Clarenceclarenceux answered 9/3, 2016 at 22:12 Comment(12)
I thought so. The above code fails with: ValueError: operands could not be broadcast together with shapes (2,50000) (2,2) . Have you tested it? Can you make it run?Trump
Change sample.shape[0] to sample.shape[1]. That value should be the number of samples. I'm translating to your example using my own test code.Clarenceclarenceux
Thanks dfb. It compiles now but the results of the MC method with your code are ~0.06. This is very different to the nquad result of ~0.194. The MC method in my question gives much closer values for the integral.Trump
See my edit - we don't want to sum the probabilities, we just test whether we hit inside the box or not on the samples.Clarenceclarenceux
Not only does this answer solve the issue in my question, it is also many times faster. Thank you very much! You should remove the iso = kernel((x1, y1)) line in your int1 function, it is not used. Thanks again!Trump
@Trump - After a brief look, I'm not sure the previous question works correctly. Monte Carlo intregration requires you to sample random points uniformly - in my int1 function, this is handled by kernel.sample. I've written a function above that does MC w/o that function, hope that helps some.Clarenceclarenceux
How do you obtain lboundx,lboundy? Something like this perhaps: lboundx = np.mean(data[0])+10*np.std(data[0])? In the tests I just made your int1 function returns values very close to nquad, while your new function does not.Trump
That's not quite right because you have to use the covariance, but it's probably a decent approximation. It does work with my example above, but I'll try with your data laterClarenceclarenceux
One other quick thought - Try increasing the n_samples *10 or so.Clarenceclarenceux
I'm not using the data file anymore, I use the random values generated with the measure() function to test all integration methods. As I said, int1 works perfectly (ie: the results match the nquad value) but your new function is off by a lot. I increased the sample by 10, no changes.Trump
Hmm.. I've pasted the parameters I'm usingClarenceclarenceux
Let us continue this discussion in chat.Clarenceclarenceux

© 2022 - 2024 — McMap. All rights reserved.