Plotting confidence intervals for Maximum Likelihood Estimate

Asked 30/1, 2014 at 10:9 Answered 1/2, 2014 at 19:37

python numpy statistics scipy statsmodels

I am trying to write code to produce confidence intervals for the number of different books in a library (as well as produce an informative plot).

My cousin is at elementary school and every week is given a book by his teacher. He then reads it and returns it in time to get another one the next week. After a while we started noticing that he was getting books he had read before and this became gradually more common over time.

Say the true number of books in the library is N and the teacher picks one uniformly at random (with replacement) to give to you each week. If at week t the number of occasions on which you have received a book you have read is x, then I can produce a maximum likelihood estimate for the number of books in the library following https://math.stackexchange.com/questions/615464/how-many-books-are-in-a-library .

Example: Consider a library with five books A, B, C, D, and E. If you receive books [A, B, A, C, B, B, D] in seven successive weeks, then the value for x (the number of duplicates) will be [0, 0, 1, 1, 2, 3, 3] after each of those weeks, meaning after seven weeks, you have received a book you have already read on three occasions.

To visualise the likelihood function (assuming I have understood what one is correctly) I have written the following code which I believe plots the likelihood function. The maximum is around 135 which is indeed the maximum likelihood estimate according to the MSE link above.

from __future__ import division
import random
import matplotlib.pyplot as plt
import numpy as np

#N is the true number of books. t is the number of weeks.unk is the true number of repeats found 
t = 30
unk = 3
def numberrepeats(N, t):
    return t - len(set([random.randint(0,N) for i in xrange(t)]))

iters = 1000
ydata = []
for N in xrange(10,500):
    sampledunk = [numberrepeats(N,t) for i in xrange(iters)].count(unk)
    ydata.append(sampledunk/iters)

print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()

The output looks like

enter image description here

My questions are these:

Is there an easy way to get a 95% confidence interval and plot it on the diagram?
How can you superimpose a smoothed curve over the plot?
Is there a better way my code should have been written? It isn't very elegant and is also quite slow.

Finding the 95% confidence interval means finding the range of the x axis so that 95% of the time the empirical maximum likelihood estimate we get by sampling (which should theoretically be 135 in this example) will fall within it. The answer @mbatchkarov has given does not currently do this correctly.

There is now a mathematical answer at https://math.stackexchange.com/questions/656101/how-to-find-a-confidence-interval-for-a-maximum-likelihood-estimate .

Noellenoellyn answered 30/1, 2014 at 10:9 Comment(5)

You should set a seed in numberrepeats so the same random sample is used for each N. – Lighthouse 30/1, 2014 at 15:58

python-for-signal-processing.blogspot.co.uk/2012/10/… looks relevant although I haven't managed to work through it yet. – Noellenoellyn 30/1, 2014 at 20:8

The problem that your question describes is very different than the one you actually implement in your script. It was only by following the link to your math.stackexchange.com question that I was able to find out what you really meant. You should consider rewriting your question to reflect the comment discussion on math.stackexchange.com. The phrase "If at week t you have received a book you have read before x times" suggests to me that you have to have received the same book 'x' times, but apparently this is not the case. – Capitation 1/2, 2014 at 19:41

@Capitation Thanks.Added clarification. Does that make it clearer? – Noellenoellyn 2/2, 2014 at 13:32

Yes, the phrase in question is now much clearer. I revised the example for clarity as well. – Capitation 3/2, 2014 at 16:15

Looks like you're ok on the first part, so I'll tackle your second and third points.

There are plenty of ways to fit smooth curves, with scipy.interpolate and splines, or with scipy.optimize.curve_fit. Personally, I prefer curve_fit, because you can supply your own function and let it fit the parameters for you.

Alternatively, if you don't want to learn a parametric function, you could do simple rolling-window smoothing with numpy.convolve.

As for code quality: you're not taking advantage of numpy's speed, because you're doing things in pure python. I would write your (existing) code like this:

from __future__ import division
import numpy as np
import matplotlib.pyplot as plt

# N is the true number of books.
# t is the number of weeks.
# unk is the true number of repeats found 
t = 30
unk = 3
def numberrepeats(N, t, iters):
    rand = np.random.randint(0, N, size=(t, iters))
    return t - np.array([len(set(r)) for r in rand])

iters = 1000
ydata = np.empty(500-10)
for N in xrange(10,500):
    sampledunk = np.count_nonzero(numberrepeats(N,t,iters) == unk)
    ydata[N-10] = sampledunk/iters

print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()

It's probably possible to optimize this even more, but this change brings your code's runtime from ~30 seconds to ~2 seconds on my machine.

Tensimeter answered 1/2, 2014 at 16:56 Comment(1)

Thanks. The first part is not answered though. The current answer is not right. I have a comment about this in the question. – Noellenoellyn 1/2, 2014 at 17:30

The a simple (numerical) way to get a confidence interval is simply to run your script many times, and see how much your estimate varies. You can use that standard deviation to calculate the confidence interval.

In the interest of time, another option is to run a bunch of trials at each value of N (I used 2000), and then use random subsampling of those trials to get an estimate of the estimator standard deviation. Basically, this involves selecting a subset of the trials, generating your likelihood curve using that subset, then finding the maximum of that curve to get your estimator. You do this over many subsets and this gives you a bunch of estimators, which you can use to find a confidence interval on your estimator. My full script is as follows:

import numpy as np

t = 30
k = 3
def trial(N):
    return t - len(np.unique(np.random.randint(0, N, size=t)))

def trials(N, n_trials):
    return np.asarray([trial(N) for i in xrange(n_trials)])

n_trials = 2000
Ns = np.arange(1, 501)
results = np.asarray([trials(N, n_trials=n_trials) for N in Ns])

def likelihood(results):
    L = (results == 3).mean(-1)

    # boxcar filtering
    n = 10
    L = np.convolve(L, np.ones(n) / float(n), mode='same')

    return L

def max_likelihood_estimate(Ns, results):
    i = np.argmax(likelihood(results))
    return Ns[i]

def max_likelihood(Ns, results):
    # calculate mean from all trials
    mean = max_likelihood_estimate(Ns, results)

    # randomly subsample results to estimate std
    n_samples = 100
    sample_frac = 0.25
    estimates = np.zeros(n_samples)
    for i in xrange(n_samples):
        mask = np.random.uniform(size=results.shape[1]) < sample_frac
        estimates[i] = max_likelihood_estimate(Ns, results[:,mask])

    std = estimates.std()
    sterr = std * np.sqrt(sample_frac) # is this mathematically sound?
    ci = (mean - 1.96*sterr, mean + 1.96*sterr)
    return mean, std, sterr, ci

mean, std, sterr, ci = max_likelihood(Ns, results)
print "Max likelihood estimate: ", mean
print "Max likelihood 95% ci: ", ci

There are two drawbacks to this method. One is that, since you're taking many subsamples from the same set of trials, your estimates are not independent. To limit the effect of this, I only used 25% of the results for each subset. Another drawback is that each subsample is only a fraction of your data, so estimates derived from these subsets will have more variance than estimates derived from running the full script many times. To account for this, I computed the standard error as the standard deviation divided by the square root of 4, since I had four times as much data in my full data set than in one of the subsamples. However, I'm not familiar enough with Monte Carlo theory to know if this is mathematically sound. Running my script a number of times did seem to indicate that my results were reasonable.

Lastly, I did use a boxcar filter on the likelihood curves to smooth them out a bit. Ideally, this should improve results, but even with the filtering there was still a considerable amount of variability in the results. When calculating the value for the overall estimator, I wasn't sure if it would be better compute one likelihood curve from all the results and use the max of that (this is what I ended up doing), or to use the mean of all the subset estimators. Using the mean of the subset estimators might be able to help cancel out some of the roughness in the curves that remains after filtering, but I'm not sure on this.

Capitation answered 1/2, 2014 at 19:37 Comment(0)

Here is an answer to your first question and a pointer to a solution for the second:

plot(xdata,ydata)
#  calculate the cumulative distribution function
cdf = np.cumsum(ydata)/sum(ydata)
# get the left and right boundary of the interval that contains 95% of the probability mass 
right=argmax(cdf>0.975)
left=argmax(cdf>0.025)
# indicate confidence interval with vertical lines
vlines(xdata[left], 0, ydata[left])
vlines(xdata[right], 0, ydata[right])
# hatch confidence interval
fill_between(xdata[left:right], ydata[left:right], facecolor='blue', alpha=0.5)

This produces the following figure: enter image description here

I'll try to answer question 3 when I have more time :)

Chancellorsville answered 30/1, 2014 at 11:4 Comment(3)

I got your code to run using np.argmax, np.sum and fixing the typo vlines(xdata[right], 0, ydata[right]) . – Noellenoellyn 30/1, 2014 at 11:11

Yes, I was running that in ipython, which automatically imports numpy functions. Sorry :) – Chancellorsville 30/1, 2014 at 11:16

Ah.. this the wrong formula for a confidence interval for a likelihood function. I believe what one is meant to do is take logs, find the maximum and then step down 2 on the left and right. I have added something to the question. Or, do it by simulation. – Noellenoellyn 30/1, 2014 at 12:4

Recommended topics

Hot tags