I want to check the stationary of a time series data saved in TS.csv.
However, R's tseries::adf.test()
and Python's statsmodels.tsa.stattools.adfuller()
give completely different results.
adf.test()
shows it's stationary (p < 0.05), while adfuller()
shows it's non-stationary (p > 0.05).
Is there any problems in the following codes?
What's the right process to test stationary of a time series in R and Python?
Thanks.
R codes:
> rd <- read.table('Data/TS.csv', sep = ',', header = TRUE)
> inp <- ts(rd$Sales, frequency = 12, start = c(1965, 1))
> inp
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1965 154 96 73 49 36 59 95 169 210 278 298 245
1966 200 118 90 79 78 91 167 169 289 347 375 203
1967 223 104 107 85 75 99 135 211 335 460 488 326
1968 346 261 224 141 148 145 223 272 445 560 612 467
1969 518 404 300 210 196 186 247 343 464 680 711 610
1970 613 392 273 322 189 257 324 404 677 858 895 664
1971 628 308 324 248 272
> library(tseries)
> adf.test(inp)
Augmented Dickey-Fuller Test
data: inp
Dickey-Fuller = -7.2564, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary
Python codes (from Time_Series.ipynb):
import pandas as pd
from statsmodels.tsa.stattools import adfuller
df = pd.read_csv('Data/TS.csv')
ts = pd.Series(list(df['Sales']), index=pd.to_datetime(df['Month'],format='%Y-%m'))
s_test = adfuller(ts, autolag='AIC')
print("p value > 0.05 means data is non-stationary: ", s_test[1])
# output: p value > 0.05 means data is non-stationary: 0.988889420517
Update
@gfgm give exellent explanations why results of R and Python are different, and how to make them the same by changing the parameters.
For the second quetsion above: "What's the right process to test stationary of a time series in R and Python?". I'd like to provide some details:
When forecast a time series, ARIMA model needs the input time series to be stationary.
If the input isn't stationary, it should be log()
ed or diff()
ed to make it stationary,
then fit it into the model.
So the problem is:
should I think the input is stationary (with R's default parameters) and fit it directly into ARIMA model,
or think it's non-stationary (with Python's default parameters),
and make it stationary with extra functions (like log()
or diff()
)?
nlag = floor(4*(length(x)/100)^(2/9))
is 3 instead of 4. And the test-statistic is -8.0345, compared with the R versionadf.test(inp) # result: -7.2564
. – Salerno