Interpolation of irregular time series with R
Asked Answered
B

2

8

Searching for linear interpolation of time series data in R, I often found recommendations to use na.approx() from the zoo package.

However, with irregular timeseries I experienced problems, because interpolated values are distributed evenly across the number of gaps, not taking into account the associated time stamp of the value.

I found a work around using approxfun(), but I wonder whether there is a cleaner solution, ideally based on tsibble objects with functions from the tidyverts package family?

Previous answers relied on expanding the irregular date grid to a regular grid by filling the gaps. However, this causes problems when daytime should be taken into account during interpolating.

Here comes a (revised) minimal example with POSIXct timestamp rather than Date only:

library(tidyverse)
library(zoo)

df <- tibble(date = as.POSIXct(c("2000-01-01 00:00", "2000-01-02 02:00", "2000-01-05 00:00")),
             value = c(1,NA,2))

df %>% 
  mutate(value_int_wrong = na.approx(value),
         value_int_correct = approxfun(date, value)(date))

# A tibble: 3 x 4
  date                value value_int_wrong value_int_correct
  <dttm>              <dbl>           <dbl>             <dbl>
1 2000-01-01 00:00:00     1             1                1   
2 2000-01-02 02:00:00    NA             1.5              1.27
3 2000-01-05 00:00:00     2             2                2   

Any ideas how to (efficently) deal with this? Thanks for your support!

Bucentaur answered 7/4, 2020 at 10:58 Comment(1)
Hi Jens, have you found a satisfying solution for your problem yet? I'd be interested.Tuinenga
E
6

Here is an equivalent tsibble-based solution. The interpolate() function needs a model, but you can use a random walk to give linear interpolation between points.

library(tidyverse)
library(tsibble)
library(fable)
#> Loading required package: fabletools

df <- tibble(
  date = as.Date(c("2000-01-01", "2000-01-02", "2000-01-05", "2000-01-06")),
  value = c(1, NA, 2, 1.5)
) %>%
  as_tsibble(index = date) %>%
  fill_gaps()

df %>%
  model(naive = ARIMA(value ~ -1 + pdq(0,1,0) + PDQ(0,0,0))) %>%
  interpolate(df)
#> # A tsibble: 6 x 2 [1D]
#>   date       value
#>   <date>     <dbl>
#> 1 2000-01-01  1   
#> 2 2000-01-02  1.25
#> 3 2000-01-03  1.5 
#> 4 2000-01-04  1.75
#> 5 2000-01-05  2   
#> 6 2000-01-06  1.5

Created on 2020-04-08 by the reprex package (v0.3.0)

Elman answered 7/4, 2020 at 23:28 Comment(3)
Hi Rob, thank you very much for you answer. I hoped you would take a look! I had to revise my minimal example, because in reality I deal with timeseries that also resolve time of the day. I tried to run your code over my revised example data set, but this caused an error message ("Could not find an appropriate ARIMA model. This is likely because automatic selection does not select models with characteristic roots that may be numerically unstable."). Can your solution be adopted to POSIXct? Thanks for sharing your expertise!Kelci
I updated my answer to be more specific in case the POSIXct was confusing it into picking a seasonal model. If it still causes an error, can you please post a bug report with a reproducible example at github.com/tidyverts/fable/issuesElman
Hi Rob, thanks again, but it does not seem to run with my minimal example. I opened an issue at github.com/tidyverts/fable/issues/256Kelci
P
1

Personally, I would go with the solution that you are using but to show how to use na.approx in this case we can complete the sequence of dates before using na.approx and join it with original df to keep original rows.

library(dplyr)

df %>% 
  tidyr::complete(date = seq(min(date), max(date), by = "day")) %>%
  mutate(value_int = zoo::na.approx(value)) %>%
  right_join(df, by = "date") %>%
  select(date, value_int)


#  date       value_int
#  <date>         <dbl>
#1 2000-01-01      1   
#2 2000-01-02      1.25
#3 2000-01-05      2   
Pimentel answered 7/4, 2020 at 11:5 Comment(2)
Hi Ronak, thanks for your immediate answer. I'm afraid that your proposed solution will work when the date vector has a high temporal resolution? I did not cover this in my minimal example, but usually the environmental time series I'm working with have a resolution of seconds, but still measurements only every couple of days.Kelci
Well, it might be inefficient but I think it should still work.Pimentel

© 2022 - 2024 — McMap. All rights reserved.