Simulating Time Series With Unobserved Components Model
Asked Answered
B

1

7

After fitting a local level model using UnobservedComponents from statsmodels , we are trying to find ways to simulate new time series with the results. Something like:

import numpy as np
import statsmodels as sm
from statsmodels.tsa.statespace.structural import UnobservedComponents

np.random.seed(12345)
ar = np.r_[1, 0.9]
ma = np.array([1])
arma_process = sm.tsa.arima_process.ArmaProcess(ar, ma)

X = 100 + arma_process.generate_sample(nsample=100)
y = 1.2 * x + np.random.normal(size=100)
y[70:] += 10

plt.plot(X, label='X')
plt.plot(y, label='y')
plt.axvline(69, linestyle='--', color='k')
plt.legend();

time series example

ss = {}
ss["endog"] = y[:70]
ss["level"] = "llevel"
ss["exog"] = X[:70]

model = UnobservedComponents(**ss)
trained_model = model.fit()

Is it possible to use trained_model to simulate new time series given the exogenous variable X[70:]? Just as we have the arma_process.generate_sample(nsample=100), we were wondering if we could do something like:

trained_model.generate_random_series(nsample=100, exog=X[70:])

The motivation behind it is so that we can compute the probability of having a time series as extreme as the observed y[70:] (p-value for identifying the response is bigger than the predicted one).

[EDIT]

After reading Josef's and cfulton's comments, I tried implementing the following:

mod1 = UnobservedComponents(np.zeros(y_post), 'llevel', exog=X_post)
mod1.simulate(f_model.params, len(X_post))

But this resulted in simulations that doesn't seem to track the predicted_mean of the forecast for X_post as exog. Here's an example:

enter image description here

While the y_post meanders around 100, the simulation is at -400. This approach always leads to p_value of 50%.

So when I tried using the initial_sate=0 and the random shocks, here's the result:

enter image description here

It seemed now that the simulations were following the predicted mean and its 95% credible interval (as cfulton commented below, this is actually a wrong approach as well as it's replacing the level variance of the trained model).

I tried using this approach just to see what p-values I'd observe. Here's how I compute the p-value:

samples = 1000
r = 0
y_post_sum = y_post.sum()
for _ in range(samples):
    sim = mod1.simulate(f_model.params, len(X_post), initial_state=0, state_shocks=np.random.normal(size=len(X_post)))
    r += sim.sum() >= y_post_sum
print(r / samples)

For context, this is the Causal Impact model developed by Google. As it's been implemented in R, we've been trying to replicate the implementation in Python using statsmodels as the core to process time series.

We already have a quite cool WIP implementation but we still need to have the p-value to know when in fact we had an impact that is not explained by mere randomness (the approach of simulating series and counting the ones whose summation surpasses y_post.sum() is also implemented in Google's model).

In my example I used y[70:] += 10. If I add just one instead of ten, Google's p-value computation returns 0.001 (there's an impact in y) whereas in Python's approach it's returning 0.247 (no impact).

Only when I add +5 to y_post is that the model returns p_value of 0.02 and as it's lower than 0.05, we consider that there's an impact in y_post.

I'm using python3, statsmodels version 0.9.0

[EDIT2]

After reading cfulton's comments I decided to fully debug the code to see what was happening. Here's what I found:

When we create an object of type UnobservedComponents, eventually the representation of the Kalman Filter is initiated. As default, it receives the parameter initial_variance as 1e6 which sets the same property of the object.

When we run the simulate method, the initial_state_cov value is created using this same value:

initial_state_cov = (
        np.eye(self.k_states, dtype=self.ssm.transition.dtype) *
        self.ssm.initial_variance
    )

Finally, this same value is used to find initial_state:

initial_state = np.random.multivariate_normal(
    self._initial_state, self._initial_state_cov)

Which results in a normal distribution with 1e6 of standard deviation.

I tried running the following then:

mod1 = UnobservedComponents(np.zeros(len(X_post)), level='llevel', exog=X_post, initial_variance=1)
sim = mod1.simulate(f_model.params, len(X_post))
plt.plot(sim, label='simul')
plt.plot(y_post, label='y')
plt.legend();
print(sim.sum() > y_post.sum())

Which resulted in:

enter image description here

I tested then the p-value and finally for a variation of +1 in y_post the model now is identifying correctly the added signal.

Still, when I tested with the same data that we have in R's Google package the p-value was still off. Maybe it's a matter of further tweaking the input to increase its accuracy.

Bidding answered 16/8, 2018 at 16:0 Comment(9)
maybe (I never tried them) examples in the unit test suit help github.com/statsmodels/statsmodels/blob/master/statsmodels/tsa/… statsmodels.org/devel/generated/…Prohibit
@Prohibit thanks a lot for those links! Will give it a try! I saw the method simulate before but interpreted as if they only worked for the model object and not the trained one. I could also take the parameters from the trained model and build a new one (a bit indirect but still I hope it works)Bidding
Hi @Josef, I followed through your recommendation and it seems to be working. Still have some questions, I wonder if you can help me. I edited my question with updated information and as you can see, I had to set initial_state to zero and state_shocks to a normal. Do you know why this is necessary or what it means? I followed the unit tests and it seems to be working but don't quite understand why.Bidding
The fact that the simulation is around -400 is concerning, and I can't replicate that. What version of Statsmodels are you using?Overrate
@Overrate I'm using 0.9.0, running a notebook in the jupyter/datascience-notebook:latest docker image. Tried restarting everything here but still the result is quite off the expected (sometimes it's almost -1000)Bidding
Is it possible that you could post a Github gist or something of a Jupyter notebook with all your code?Overrate
@Overrate sure! here it goes: gist.github.com/WillianFuks/0c840d31e9fe8f707294f807d42b0b43Bidding
Hi @cfulton, I just added a new "EDIT2" in my question to discuss the initial values I'm observing in the model construction. As it seems, looks like the initial_variance is receiving a default value big enough to deviate considerably the initial_state of the simulation. Not sure though if what I did is 100% correct.Bidding
@WillianFuks thanks for the additional edit, I edited my answer to respond (and sorry about the delay).Overrate
O
5

@Josef is correct and you did the right thing with:

mod1 = UnobservedComponents(np.zeros(y_post), 'llevel', exog=X_post)
mod1.simulate(f_model.params, len(X_post))

The simulate method simulates data according to the model in question, which is why you can't directly use trained_model to simulate when you have exogenous variables.

But for some reason the simulations always ended up being lower than y_post.

I think this should be expected - running your example and looking at the estimated coefficients, we get:

                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
sigma2.irregular     0.9278      0.194      4.794      0.000       0.548       1.307
sigma2.level         0.0021      0.008      0.270      0.787      -0.013       0.018
beta.x1              1.1882      0.058     20.347      0.000       1.074       1.303

The variance of the level is very small, which means that it is extremely unlikely that the level would shift upwards by nearly 10 percent in a single period, based on the model you specified.

When you used:

mod1.simulate(f_model.params, len(X_post), initial_state=0, state_shocks=np.random.normal(size=len(X_post))

what happened is that the level term is the only unobserved state here, and by providing your own shocks with a variance equal to 1, you essentially overrode the level variance actually estimated by the model. I don't think that setting the initial state to 0 has much of an effect here. (see edit).

You write:

the p-value computation was closer, but still is not correct.

I'm not sure what this means - why would you expect the model to think such a jump was a likely occurrence? What p-value are you expecting to achieve?

Edit:

Thanks for investigating further (in Edit 2). First, what I think you should do is:

mod1 = UnobservedComponents(np.zeros(y_post), 'llevel', exog=X_post)
initial_state = np.random.multivariate_normal(
    f_model.predicted_state[..., -1], f_model.predicted_state_cov[..., -1])
mod1.simulate(f_model.params, len(X_post), initial_state=initial_state)

Now, the explanation:

In Statsmodels 0.9, we didn't yet have exact treatment of states with a diffuse initialization (it has been merged in since then, though, and this is one reason that I wasn't able to replicate your results until I tested your example with the 0.9 codebase). These "initially diffuse" states don't have a long-run mean that we can solve for (e.g. a random walk process), and the state in the local level case is such a state.

The "approximate" diffuse initialization involves setting the initial state mean to zero and the initial state variance to a large number (as you discovered).

For simulations, the initial state is, by default, sampled from the given initial state distribution. Since this model is initialized with approximate diffuse initialization, that explains why your process was initialized around some random number.

Your solution is a good patch, but it's not optimal because it doesn't base the initial state for the simulated period on the last state from the estimated model / data. These values are given by f_model.predicted_state[..., -1] and f_model.predicted_state_cov[..., -1].

Overrate answered 17/8, 2018 at 23:44 Comment(3)
Hi cfulton, thanks for the help (and also for developing statsmodels, I've been studying the code and reading the docs to find how complex and valuable it is for the Python community). I've updated my question with (hopefully) better information now. I still can't find proper values for the p-value when in fact there's signal in y_post. Maybe there's still something wrong I'm doing when running the simulations. If you need more info please let me know.Bidding
That's amazing! Thank you very much Fulton for the help, now it's working exactly as the R's model! We can finally fully port the algorithm to Python thanks to your support and to statsmodels :)!Bidding
Thanks for following up! This is an interesting use case, and I'm glad we got it solved.Overrate

© 2022 - 2024 — McMap. All rights reserved.