How to choose the correct arguments of statsmodels STL function?
Asked Answered
C

1

7

I've been reading about time-series decomposition, and have a fairly good idea of how it works on simple examples, but am having trouble extending the concepts.

For example, some simple synthetic data I'm playing with:

enter image description here

So there is no actual time associated with this data. It could be sampled every second or every year. Whatever the sampling frequency, the period is roughly 160 time steps, and using this as the period argument yields the expected results:

# seasonal=13 based on example in the statsmodels user guide
decomp = STL(synth.value, period=160, seasonal=13).fit()

fig, ax = plt.subplots(3,1, figsize=(12,6))
decomp.trend.plot(title='Trend', ax=ax[0])
decomp.seasonal.plot(title='Seasonal', ax=ax[1])
decomp.resid.plot(title='Residual', ax=ax[2])
plt.tight_layout()
plt.show()

enter image description here

But looking at other datasets, it's not really that easy to see the period of the seasonality, so it leads me to a couple of questions:

How do you find the correct arguments in real-world messy data, particularly the period argument but also the others too? Is it just a parameter search that you perform until the decomposition looks sane?

Parameters

endog : array_like Data to be decomposed. Must be squeezable to 1-d.

period : Periodicity of the sequence. If None and endog is a pandas Series or DataFrame, attempts to determine from endog. If endog is a ndarray, period must be provided.

seasonal : Length of the seasonal smoother. Must be an odd integer, and should normally be >= 7 (default).

trend : Length of the trend smoother. Must be an odd integer. If not provided uses the smallest odd integer greater than 1.5 * period / (1 - 1.5 / seasonal), following the suggestion in the original implementation.

Cohberg answered 5/2, 2021 at 17:2 Comment(2)
Same question with you. If I only have 1 or 2 years time series at daily freqency, I really dont know how to clarify period parameter...Pachisi
The official guide almost have nothing about parametersPachisi
F
9

I had the same question. After tracing some of their codebase, I have found the following. This may help:

  • Statsmodels expects a DatetimeIndex'd DataFrame.
  • This DatetimeIndex can have a frequency. You can either resample your data with Pandas, or explicitly set a frequency in your index. You can check df.index, look for the freq attribute.

This leads to two situations:

Your index has frequency set

If you have set a frequency in your index, statsmodels will inherit this frequency and automatically use this to determine a period. It makes use of the freq_to_period method internally, defined here in the tsatools submodule.

To summarise what this does: The period is the expected periodicity of your seasonal component, translated back to a year..

In other words: "how often your seasonal cycle will repeat itself in a year". For reference, read the note on the freq_to_period method definition: Annual maps to 1, quarterly maps to 4, monthly to 12, weekly to 52.

This is both done for the method seasonal_decompose here, as well as for STL here.

Your index has no frequency set

It gets a bit more complicated if your data does not have a freq attribute set. The seasonal_decompose checks whether it can find an inferred_freq attribute of your index set here, STL takes the same approach here.

This inferred_freq was set using the pandas function infer_freq, which is defined in the Pandas package here, to Infer the most likely frequency given the input index.. Pandas automatically gives a DataFrame with a DatetimeIndex an index.inferred_freq attribute by default, if you have at least 3 elements.

TLDR: The period parameter should be set to the amount of times you expect the seasonal cycle to re-occur within a year. You can explicitly set this, or otherwise statsmodels will automatically infer this from the freq attribute of your datetimeindex. If the freq attribute is None, it will depend on Pandas' index.inferred_freq attribute to determine the frequency, and then convert this to pre-set periodicity.

Faradize answered 4/3, 2022 at 11:29 Comment(5)
my input to endog is a pandas.Series with continuous values as inputs with DataTimeIndex as index of type Timestamp with %Y-%m-%d format and are of yearly frequency. When I passed that as input, I got ValueError: period must be a positive integer >= 2 error. Setting period to 1, still gave me that error. Since your answer mentions that "yearly maps to 1", is setting period to 2 correct?Gadolinium
I would expect your code to run correctly with the period set to 1 indeed. However, I see that the ValueError is indeed raised in STL here. Does the input data already has a frequency set on its index? You could try to set the frequency of your index to a yearly value (.as_freq('Y') - note that this used to be 'A' prior to pandas 2.2.0). I'm not sure if that will work.Faradize
Also see this discussionFaradize
I don't think the statement: "The period parameter should be set to the amount of times you expect the seasonal cycle to re-occur within a year." is right. The source inferred_freq shows that for freq=A, Q, M, and W, it expects each row to be a Year, Quarter etc. apart. It then assumes for those cases a yearly frequency, and a period length of 1, 4, 12, and 52, respectively. This is in line with your statement. For freq=D, B, and H, it assumes a weekly (D and B) and daily frequencies, and returns a period length of 7, 5, and 24. In disagreement with your statement.Shilashilha
For completeness, based on the source of inferred_freq, I understand that you should set period to the number of rows (if there were no gaps) after which you expect your data to repeat a cycle. OP expects a new cycle every 160 steps, which is why setting period=160 works.Shilashilha

© 2022 - 2024 — McMap. All rights reserved.