decompose() for time series: ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None
Asked Answered
B

9

34

I have some problems executing an additive model right. I have the following data frame:

1

And when I run this code:

import statsmodels as sm
import statsmodels.api as sm
decomposition = sm.tsa.seasonal_decompose(df, model = 'additive')
fig = decomposition.plot()
matplotlib.rcParams['figure.figsize'] = [9.0,5.0]

I got that message:

ValueError: You must specify a period or x must be a pandas object with a >DatetimeIndex with a freq not set to None

What should I do in order to get that example: 2

The screen above I took from this place

Bridgeboard answered 1/2, 2020 at 12:53 Comment(0)
A
48

Having the same ValueError, this is just the result of some testing and little research on my own, without the claim to be complete or professional about it. Please comment or answer whoever finds something wrong.

Of course, your data should be in the right order of the index values, which you would assure with df.sort_index(inplace=True), as you state it in your answer. This is not wrong as such, though the error message is not about the sort order, and I have checked this: the error does not go away in my case when I sort the index of a huge dataset I have at hand. It is true, I also have to sort the df.index, but the decompose() can handle unsorted data as well where items jump here and there in time: then you simply get a lot of blue lines from left to the right and back, until the whole graph is full of it. What is more, usually, the sorting is already in the right order anyway. In my case, sorting does not help fixing the error. Thus I also doubt that index sorting has fixed the error in your case, because: what does the error actually say?

ValueError: You must specify:

  1. [either] a period
  2. or x must be a pandas object with a DatetimeIndex with a freq not set to None

Before all, in case you have a list column so that your time series is nested up to now, see Convert pandas df with data in a "list column" into a time series in long format. Use three columns: [list of data] + [timestamp] + [duration] for details how to unnest a list column. This would be needed for both 1.) and 2.).

Details of 1.: "You must specify [either] a period ..."

Definition of period

"period, int, optional" from https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html:

Period of the series. Must be used if x is not a pandas object or if the index of x does not have a frequency. Overrides default periodicity of x if x is a pandas object with a timeseries index.

The period parameter that is set with an integer means the number of cycles which you expect to be in the data. If you have a df with 1000 rows with a list column in it (call it df_nested), and each list with for example 100 elements, then you will have 100 elements per cycle. It is probably smart taking period = len(df_nested) (= number of cycles) in order to get the best split of seasonality and trend. If your elements per cycle vary over time, other values may be better. I am not sure about how to rightly set the parameter, therefore the question statsmodels seasonal_decompose(): What is the right “period of the series” in the context of a list column (constant vs. varying number of items) on Cross Validated which is not yet answered.

The "period" parameter of option 1.) has a big advantage over option 2.). Though it uses the time index (DatetimeIndex) for its x-axis, it does not require an item to hit the frequency exactly, in contrast to option 2.). Instead, it just joins together whatever is in a row, with the advantage that you do not need to fill any gaps: the last value of the previous event is just joined with the next value of the following event, whether it is already in the next second or on the next day.

What is the max possible "period" value? In case you have a list column (call the df "df_nested" again), you should first unnest the list column to a normal column. The max period is len(df_unnested)/2.

Example1: 20 items in x (x is the amount of all items of df_unnested) can maximally have a period = 10.

Example2: Having the 20 items and taking period=20 instead, this throws the following error:

ValueError: x must have 2 complete cycles requires 40 observations. x only has 20 observation(s)

Another side-note: To get rid of the error in question, period = 1 should already take it away, but for time series analysis, "=1" does not reveal anything new, every cycle is just 1 item then, the trend is the same as the original data, the seasonality is 0, and the residuals are always 0.

####

Example borrowed from Convert pandas df with data in a "list column" into a time series in long format. Use three columns: [list of data] + [timestamp] + [duration]

df_test = pd.DataFrame({'timestamp': [1462352000000000000, 1462352100000000000, 1462352200000000000, 1462352300000000000],
                'listData': [[1,2,1,9], [2,2,3,0], [1,3,3,0], [1,1,3,9]],
                'duration_sec': [3.0, 3.0, 3.0, 3.0]})
tdi = pd.DatetimeIndex(df_test.timestamp)
df_test.set_index(tdi, inplace=True)
df_test.drop(columns='timestamp', inplace=True)
df_test.index.name = 'datetimeindex'

df_test = df_test.explode('listData') 
sizes = df_test.groupby(level=0)['listData'].transform('size').sub(1)
duration = df_test['duration_sec'].div(sizes)
df_test.index += pd.to_timedelta(df_test.groupby(level=0).cumcount() * duration, unit='s') 

The resulting df_test['listData'] looks as follows:

2016-05-04 08:53:20    1
2016-05-04 08:53:21    2
2016-05-04 08:53:22    1
2016-05-04 08:53:23    9
2016-05-04 08:55:00    2
2016-05-04 08:55:01    2
2016-05-04 08:55:02    3
2016-05-04 08:55:03    0
2016-05-04 08:56:40    1
2016-05-04 08:56:41    3
2016-05-04 08:56:42    3
2016-05-04 08:56:43    0
2016-05-04 08:58:20    1
2016-05-04 08:58:21    1
2016-05-04 08:58:22    3
2016-05-04 08:58:23    9

Now have a look at different period's integer values.

period = 1:

result_add = seasonal_decompose(x=df_test['listData'], model='additive', extrapolate_trend='freq', period=1)
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

enter image description here

period = 2:

result_add = seasonal_decompose(x=df_test['listData'], model='additive', extrapolate_trend='freq', period=2)
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

enter image description here

If you take a quarter of all items as one cycle which is 4 (out of 16 items) here.

period = 4:

result_add = seasonal_decompose(x=df_test['listData'], model='additive', extrapolate_trend='freq', period=int(len(df_test)/4))
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

enter image description here

Or if you take the max possible size of a cycle which is 8 (out of 16 items) here.

period = 8:

result_add = seasonal_decompose(x=df_test['listData'], model='additive', extrapolate_trend='freq', period=int(len(df_test)/2))
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

enter image description here

Have a look at how the y-axes change their scale.

####

You will increase the period integer according to your needs. The max in your case of the question:

sm.tsa.seasonal_decompose(df, model = 'additive', period = int(len(df)/2))

Details of 2.: "... or x must be a pandas object with a DatetimeIndex with a freq not set to None"

To get x to be a DatetimeIndex with a freq not set to None, you need to assign the freq of the DatetimeIndex using .asfreq('?') with ? being your choice among a wide range of offset aliases from https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.

In your case, this option 2. is the better suited as you seem to have a list without gaps. Your monthly data then should probably be introduced as "month start frequency" --> "MS" as offset alias:

sm.tsa.seasonal_decompose(df.asfreq('MS'), model = 'additive')

See How to set frequency with pd.to_datetime()? for more details, also about how you would deal with gaps.

If you have data that is highly scattered in time so that you have too many gaps to fill or if gaps in time are nothing important, option 1 of using "period" is probably the better choice.

In my example case of df_test, option 2. is not good. The data is totally scattered in time, and if I take a second as the frequency, you get this:

Output of df_test.asfreq('s') (=frequency in seconds):

2016-05-04 08:53:20      1
2016-05-04 08:53:21      2
2016-05-04 08:53:22      1
2016-05-04 08:53:23      9
2016-05-04 08:53:24    NaN
                      ...
2016-05-04 08:58:19    NaN
2016-05-04 08:58:20      1
2016-05-04 08:58:21      1
2016-05-04 08:58:22      3
2016-05-04 08:58:23      9
Freq: S, Name: listData, Length: 304, dtype: object

You see here that although my data is only 16 rows, introducing a frequency in seconds forces the df to be 304 rows only to reach out from "08:53:20" till "08:58:23", 288 gaps are caused here. What is more, here you have to hit the exact time. If you have 0.1 or even 0.12314 seconds as your real frequency instead, you will not hit most of the items with your index.

Here an example with min as the offset alias, df_test.asfreq('min'):

2016-05-04 08:53:20      1
2016-05-04 08:54:20    NaN
2016-05-04 08:55:20    NaN
2016-05-04 08:56:20    NaN
2016-05-04 08:57:20    NaN
2016-05-04 08:58:20      1

We see that only the first and the last minute are filled at all, the rest is not hit.

Taking the day as as the offset alias, df_test.asfreq('d'):

2016-05-04 08:53:20    1

We see that you get only the first row as the resulting df, since there is only one day covered. It will give you the first item found, the rest is dropped.

The end of it all

Putting together all of this, in your case, take option 2., while in my example case of df_test, option 1 is needed.

Arteriole answered 4/8, 2020 at 17:50 Comment(0)
I
11

The reason might be that you have gaps in your data. For example:

enter image description here

This data has gaps, it will cause an Exception in seasonal_decompose() method

enter image description here

This data is good, all days are covered, no Exception will be raised

Incapable answered 4/6, 2021 at 13:42 Comment(0)
C
6

I had the same issue. Solve this problem by forcing the period to be an integer.

So, I used the following for my particular case.

decompose_result = seasonal_decompose(df.Sales, model='multiplicative', period=1)
decompose_result.plot();

where df.Sales is a Pandas series with a step-size of one between any two elements.

PS. You can find details of seasonal_decompose() by typing seasonal_decompose? command. You will get details like the following. See the details of each param.

**Signature:**
seasonal_decompose(
    x,
    model='additive',
    filt=None,
    period=None,
    two_sided=True,
    extrapolate_trend=0,
)
Croup answered 24/11, 2021 at 15:32 Comment(0)
S
5

I've had the same issue and it eventually turned out (in my case at lease) to be an issue of missing data points in my dataset. In example I have hourly data for a certain period of time and there where 2 separate hourly data points missing (in the middle of the dataset). So I got the same error. When testing on a different dataset with no missing data points, it worked without any error messages. Hope this helps. It's not exactly a solution.

Socialization answered 14/12, 2020 at 19:21 Comment(0)
C
4

I got the same error, I got it due to some dates were missing. Quick fix around here is just add those dates along with default values.

Be cautious while using default values

  • If your model is additive then it can have 0
  • If your model is additive then it can't have 0 so you can use 1

Code:-

from datetime import date, timedelta
import pandas as pd

#Start date and end_date
start_date = pd.to_datetime("2019-06-01")
end_date = pd.to_datetime("2021-08-20") - timedelta(days=1) #Excluding last

#List of all dates
all_date = pd.date_range(start_date, end_date, freq='d')

#Left join your main data on dates data
all_date_df = pd.DataFrame({'date':all_date})
tdf = df.groupby('date', as_index=False)['session_count'].sum()
tdf = pd.merge(all_date_df, tdf, on='date', how="left")
tdf.fillna(0, inplace=True)
Coprolalia answered 23/8, 2021 at 9:18 Comment(0)
G
3

I assume that you forgot to introduce period and pass it to freq argument for seasonal_decompose(). That's why it threw out the following ValueError:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-9b030cf1055e> in <module>()
----> 1 decomposition = sm.tsa.seasonal_decompose(df, model = 'additive')
      2 decompose_result.plot()

/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/seasonal.py in seasonal_decompose(x, model, filt, freq, two_sided, extrapolate_trend)
    125             freq = pfreq
    126         else:
--> 127             raise ValueError("You must specify a freq or x must be a "
    128                              "pandas object with a timeseries index with "
    129                              "a freq not set to None")

ValueError: You must specify a freq or x must be a pandas object with a time-series index with a freq not set to None

Note: Probably due to recent update of this module there is no period argument available and If you use the period argument for seasonal_decompose() you will face the following TypeError:

TypeError: seasonal_decompose() got an unexpected keyword argument 'period'

So please follow the below scripts:

# import libraries
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
 
# Generate time-series data
total_duration = 100
step = 0.01
time = np.arange(0, total_duration, step)
 
# Period of the sinusoidal signal in seconds
T= 15
 
# Period component
series_periodic = np.sin((2*np.pi/T)*time)
 
# Add a trend component
k0 = 2
k1 = 2
k2 = 0.05
k3 = 0.001
 
series_periodic = k0*series_periodic
series_trend    = k1*np.ones(len(time))+k2*time+k3*time**2
series          = series_periodic+series_trend 

# Set frequency using period in seasonal_decompose()
period = int(T/step)
results = seasonal_decompose(series, model='additive', freq=period)

trend_estimate    = results.trend
periodic_estimate = results.seasonal
residual          = results.resid
 
# Plot the time-series componentsplt.figure(figsize=(14,10))
plt.subplot(221)
plt.plot(series,label='Original time series', color='blue')
plt.plot(trend_estimate ,label='Trend of time series' , color='red')
plt.legend(loc='best',fontsize=20 , bbox_to_anchor=(0.90, -0.05))
plt.subplot(222)
plt.plot(trend_estimate,label='Trend of time series',color='blue')
plt.legend(loc='best',fontsize=20, bbox_to_anchor=(0.90, -0.05))
plt.subplot(223)
plt.plot(periodic_estimate,label='Seasonality of time series',color='blue')
plt.legend(loc='best',fontsize=20, bbox_to_anchor=(0.90, -0.05))
plt.subplot(224)
plt.plot(residual,label='Decomposition residuals of time series',color='blue')
plt.legend(loc='best',fontsize=20, bbox_to_anchor=(1.09, -0.05))
plt.tight_layout()
plt.savefig('decomposition.png')

Plot the time-series components: img

If you are working with dataframe:

# import libraries
import numpy as np
import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose

# Generate some data
np.random.seed(0)
n = 1500
dates = np.array('2020-01-01', dtype=np.datetime64) + np.arange(n)
data = 12*np.sin(2*np.pi*np.arange(n)/365) + np.random.normal(12, 2, 1500)

#=================> Approach#1 <==================
# Set period after building dataframe
df = pd.DataFrame({'data': data}, index=dates)

# Reproduce the OP's example  
seasonal_decompose(df['data'], model='additive', freq=15).plot()

#=================> Approach#2 <==================
# create period once you create pandas dataframe by asfreq() after set dates as index
df = pd.DataFrame({'data': data,}, index=dates).asfreq('D').dropna()

# Reproduce the example for OP
seasonal_decompose(df , model='additive').plot()

img

Galilean answered 18/12, 2021 at 14:38 Comment(1)
You say that period parameter might have been dropped, but it is not. In the latest version, statsmodels v0.13.2, it is in the list of parameters at statsmodels.org/stable/generated/…. It is an open question, though, how it should be used, see statsmodels seasonal_decompose(): What is the right "period of the series" in the context of a list column (constant vs. varying number of items).Arteriole
C
0

I recently used the 'Prophet' package for this. It used to be called 'FBProphet', but they dropped the FB (FaceBook) part for some reason.

It was a bit harder to install on a windows pc (you'll need miniconda to install it in that case).

But when it is installed, it very user friendly, required just 1 line of code and works like a charm! Also able to decompose seasonality, give n% accuracy graphs, and make forecasts.

It can also easily account for holidays if you want it to, this is pre-build into the package.

There are a lots of youtube videos about this package

https://github.com/facebook/prophet/tree/main/python

Cagliari answered 27/5, 2022 at 16:57 Comment(0)
M
0
decomposition = sm.tsa.seasonal_decompose(df, model = 'additive', period=7)

If you know lags then putting period will solve the issue. If we visit the description of seasonal_decompose function, we can identify period is set None.

Multifaceted answered 6/9, 2023 at 11:13 Comment(0)
B
-1

To resolve this problem, I have executed sort_index and the code above worked

df.sort_index(inplace= True)
Bridgeboard answered 2/2, 2020 at 8:31 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.