Generating predictions from inferred parameters in pymc3
Asked Answered
C

2

22

I run into a common problem I'm wondering if someone can help with. I often would like to use pymc3 in two modes: training (i.e. actually running inference on parameters) and evaluation (i.e. using inferred parameters to generate predictions).

In general, I'd like a posterior over predictions, not just point-wise estimates (that's part of the benefit of the Bayesian framework, no?). When your training data is fixed, this is typically accomplished by adding simulated variable of a similar form to the observed variable. For example,

from pymc3 import *

with basic_model:

    # Priors for unknown model parameters
    alpha = Normal('alpha', mu=0, sd=10)
    beta = Normal('beta', mu=0, sd=10, shape=2)
    sigma = HalfNormal('sigma', sd=1)

    # Expected value of outcome
    mu = alpha + beta[0]*X1 + beta[1]*X2

    # Likelihood (sampling distribution) of observations
    Y_obs = Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
    Y_sim = Normal('Y_sim', mu=mu, sd=sigma, shape=len(X1))

    start = find_MAP()
    step = NUTS(scaling=start)
    trace = sample(2000, step, start=start)

But what if my data changes? Say I want to generate predictions based on new data, but without running inference all over again. Ideally, I'd have a function like predict_posterior(X1_new, X2_new, 'Y_sim', trace=trace) or even predict_point(X1_new, X2_new, 'Y_sim', vals=trace[-1]) that would simply run the new data through the theano computation graph.

I suppose part of my question relates to how pymc3 implements the theano computation graph. I've noticed that the function model.Y_sim.eval seems similar to what I want, but it requires Y_sim as an input and seems just to return whatever you give it.

I imagine this process is extremely common, but I can't seem to find any way to do it. Any help is greatly appreciated. (Note also that I have a hack to do this in pymc2; it's more difficult in pymc3 because of theano.)

Crevasse answered 21/10, 2015 at 0:43 Comment(4)
You are talking about sampling from the posterior predictive distribution, which you appear to be doing correctly. not sure exactly what you mean by "based on new data", though. Are you talking about using the posteriors from this analysis as priors for inference based on additional data?Illinium
@ChrisFonnesbeck That is something I would also be interested in since the posteriors we get are in the form of a trace and we can't use those to specify the priors in the example's syntax.Smetana
twiecki on the pymc3 gitter page pointed me to this page, which seems to address the issue I'm facing. I'll have to spend some time to understand what was done, but it looks promising.Crevasse
With sufficient data, posteriors tend to be (multivariate) normal. A simple approach would be to extract the mean and sd of the posterior, and use that to parameterize normal priors for the subsequent analysis.Illinium
C
13

Note: This functionality is now incorporated in the core code as the pymc.sample_ppc method. Check out the docs for more info.

Based on this link (dead as of July 2017) sent to me by twiecki, there are a couple tricks to solve my issue. The first is to put the training data into a shared theano variable. This allows us to change the data later without screwing up the theano computation graph.

X1_shared = theano.shared(X1)
X2_shared = theano.shared(X2)

Next, build the model and run the inference as usual, but using the shared variables.

with basic_model:

    # Priors for unknown model parameters
    alpha = Normal('alpha', mu=0, sd=10)
    beta = Normal('beta', mu=0, sd=10, shape=2)
    sigma = HalfNormal('sigma', sd=1)

    # Expected value of outcome
    mu = alpha + beta[0]*X1_shared + beta[1]*X2_shared

    # Likelihood (sampling distribution) of observations
    Y_obs = Normal('Y_obs', mu=mu, sd=sigma, observed=Y)

    start = find_MAP()
    step = NUTS(scaling=start)
    trace = sample(2000, step, start=start)

Finally, there's a function under development (will likely eventually get added to pymc3) that will allow to predict posteriors for new data.

from collections import defaultdict

def run_ppc(trace, samples=100, model=None):
    """Generate Posterior Predictive samples from a model given a trace.
    """
    if model is None:
         model = pm.modelcontext(model)

    ppc = defaultdict(list)
    for idx in np.random.randint(0, len(trace), samples):
        param = trace[idx]
        for obs in model.observed_RVs:
            ppc[obs.name].append(obs.distribution.random(point=param))

    return ppc

Next, pass in the new data that you want to run predictions on:

X1_shared.set_value(X1_new)
X2_shared.set_value(X2_new)

Finally, you can generate posterior predictive samples for the new data.

ppc = run_ppc(trace, model=model, samples=200)

The variable ppc is a dictionary with keys for each observed variable in the model. So, in this case ppc['Y_obs'] would contain a list of arrays, each of which is generated using a single set of parameters from trace.

Note that you can even modify the parameters extracted from the trace. For example, I had a model using a GaussianRandomWalk variable and I wanted to generate predictions into the future. While you could allow pymc3 to sample into the future (i.e. allow the random walk variable to diverge), I just wanted to use a fixed value of the coefficient corresponding to the last inferred value. This logic can implemented in the run_ppc function.

It's also worth mentioning that the run_ppc function is extremely slow. It takes about as much time as running the actual inference. I suspect this has to do with some inefficiency related to how theano is used.

EDIT: The link originally included seems to be dead.

Crevasse answered 22/10, 2015 at 21:16 Comment(0)
D
5

Above answer from @santon is correct. I am just adding to that.

Now you don't need to write your own method run_ppc. pymc3 provides sample_posterior_predictive method which does the same.

Death answered 26/4, 2019 at 9:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.