Python: pmdarima, autoarima does not work with large data
Asked Answered
C

2

7

I have a Dataframe with around 80.000 observations taken every 15 min. The seasonal parameter m is assumed with 96, because every 24h the pattern repeats. When I insert these informations in my auto_arima algorithm, it takes a long time (some hours) until the following error message is given out:

MemoryError: Unable to allocate 5.50 GiB for an array with shape (99, 99, 75361) and data type float64

The code that I am using:

stepwise_fit = auto_arima(df['Hges'], seasonal=True, m=96, stepwise=True, 
                          stationary=True, trace=True)
print(stepwise_fit.summary())

I tried it with resampling to hourly values, to reduce the amount of data and the m-factor to 24, but still my computer cannot calculate the result.

How do find the weighting factors with auto_arima when you deal with large data ?

Cousteau answered 8/7, 2020 at 9:7 Comment(1)
as the already existing answers say, it seems like too much data for ARIMA. I tried auto_arima with a large dataframe (4500 values instead of 75000) and It also crashed. However, by increasing the Windows 10 page file size a lot (to 150Gbytes, so you need hard disk free space of that size), it was able to handle it. Anyways, I think an LSTM would match better with that type of dataframes.Arbitress
P
13

I don't recall the exact source where I read this, but neither auto.arima nor pmdarima are really optimized to scale, which might explain the issues you are facing.

But there are some more important things to note about your question: With 80K data points at 15 minute intervals, ARIMA probably isn't the best type of model for your use case anyway:

  • With the frequency and density of your data, it is likely that there are multiple cycles/seasonal patterns, and ARIMA can handle only one seasonal component. So at the very least you should try a model that can handle multiple seasonalities like STS or Prophet (TBATS in R can also handle multiple seasonalities, but it is likely to suffer from the same issues as auto.arima, since it is in the same package).
  • At 80K points and 15 minute measurement intervals, I assume you are most likely dealing with a "physical" time series that is the output of a sensor or some other metering/monitoring device (electrical load, network traffic, etc...). These types of time series are usually very good use cases for LSTM or other Deep Learning based models instead of ARIMA.
Philous answered 8/7, 2020 at 22:58 Comment(1)
Thank you for this answer ! This helps me a lot. Indeed I am dealing with electrical load and want to forecast one day ahead. I would like to mark your answer as useful, but I can't with a rep of 13. I will do this when my reputation grows :-)Cousteau
F
2

pmdarima is not good scaling. You should try the autoarima of statsforecast. It is compiled using numba to highly efficient machine code.

Fennec answered 4/3, 2022 at 19:0 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Cati

© 2022 - 2024 — McMap. All rights reserved.